Instructions to use dogeplusplus/duo-laguna-adapter with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use dogeplusplus/duo-laguna-adapter with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="dogeplusplus/duo-laguna-adapter", trust_remote_code=True)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("dogeplusplus/duo-laguna-adapter", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("dogeplusplus/duo-laguna-adapter", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use dogeplusplus/duo-laguna-adapter with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "dogeplusplus/duo-laguna-adapter" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dogeplusplus/duo-laguna-adapter", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/dogeplusplus/duo-laguna-adapter
- SGLang
How to use dogeplusplus/duo-laguna-adapter with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "dogeplusplus/duo-laguna-adapter" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dogeplusplus/duo-laguna-adapter", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "dogeplusplus/duo-laguna-adapter" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dogeplusplus/duo-laguna-adapter", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use dogeplusplus/duo-laguna-adapter with Docker Model Runner:
docker model run hf.co/dogeplusplus/duo-laguna-adapter
DuoAttention Laguna Adapter
This repository contains adapter-only DuoAttention head weights for
poolside/Laguna-XS.2. It does not include the Laguna base weights or tokenizer.
DuoAttention reduces long-context KV-cache growth by learning which KV heads
need full history and letting the remaining heads keep only a sink window plus
recent tokens. This Laguna adapter loads the base model, applies the learned
head mask from duo_attention/full_attention_heads.pt, and enables DuoAttention
with sink size 64 and recent size 256.
Why Use It
- Smaller KV cache for long prompts and generation.
- Adapter-only distribution, so the base Laguna model remains separate.
trust_remote_code=Trueloading applies the Laguna DuoAttention patch.- Laguna-specific support for gated attention projections and KV-head reordering.
Figures From The DuoAttention Paper
Paper: DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads.
Laguna Adapter Figure
This figure visualizes the optimized Laguna DuoAttention gating values produced for this adapter.
Laguna Results
We ran a mixed-precision KV benchmark as a Hugging Face Job on
poolside/Laguna-XS.2. The base line accounts for dense Laguna FP8 KV cache;
the DuoAttention path stores retrieval heads as FP8 and streaming heads as
packed INT4 with per-group scale/zero-point metadata.
| Prompt | Decode | Base KV | Duo KV | KV Reduction |
|---|---|---|---|---|
| 512 | 1 | 40.08 MiB | 24.03 MiB | 40.04% |
| 512 | 16 | 41.25 MiB | 24.50 MiB | 40.61% |
| 512 | 64 | 45.00 MiB | 26.00 MiB | 42.22% |
| 1,024 | 1 | 80.08 MiB | 40.03 MiB | 50.01% |
| 1,024 | 16 | 81.25 MiB | 40.50 MiB | 50.15% |
| 1,024 | 64 | 85.00 MiB | 42.00 MiB | 50.59% |
| 1,462 | 1 | 114.30 MiB | 53.72 MiB | 53.00% |
| 1,462 | 16 | 115.47 MiB | 54.19 MiB | 53.07% |
| 1,462 | 64 | 119.22 MiB | 55.69 MiB | 53.29% |
W&B run: ox2c0m6s
Laguna-Specific Changes
- Ported DuoAttention from Llama/Mistral-style attention modules to Laguna's gated attention structure.
- Preserved Laguna's
g_projgated output path when splitting full-context and streaming heads. - Reordered Laguna Q/K/V/gating/output projections so full and streaming KV heads remain aligned after patching.
- Added adapter-only loading that fetches the base Laguna model separately and
applies the learned
full_attention_headstensor at load time. - Kept decode compatible with the patched tuple KV cache path used by the current Laguna remote code.
Usage
Install optional tokenizer dependencies if needed:
pip install sentencepiece tiktoken
Load the base tokenizer and compare the base Laguna cache with the DuoAttention cache on the same non-trivial prompt:
import gc
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
adapter_repo = "dogeplusplus/duo-laguna-adapter"
base_model_id = "poolside/Laguna-XS.2"
tokenizer = AutoTokenizer.from_pretrained(
base_model_id,
trust_remote_code=True,
token=True,
)
model_kwargs = {
"trust_remote_code": True,
"token": True,
}
if torch.cuda.is_available():
model_kwargs["dtype"] = torch.bfloat16
model_kwargs["device_map"] = {"": "cuda:0"}
else:
model_kwargs["torch_dtype"] = "auto"
model_kwargs["device_map"] = "auto"
def cache_nbytes(value):
if value is None:
return 0
if torch.is_tensor(value):
return value.numel() * value.element_size()
if hasattr(value, "key_cache") and hasattr(value, "value_cache"):
return cache_nbytes(value.key_cache) + cache_nbytes(value.value_cache)
if hasattr(value, "to_legacy_cache"):
try:
return cache_nbytes(value.to_legacy_cache())
except Exception:
pass
if isinstance(value, dict):
return sum(cache_nbytes(v) for v in value.values())
if isinstance(value, (list, tuple)):
return sum(cache_nbytes(v) for v in value)
return 0
def first_parameter_device(model):
return next(model.parameters()).device
def dense_kv_cache_nbytes(config, tokens, dtype):
num_layers = config.num_hidden_layers
num_key_value_heads = config.num_key_value_heads
head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
bytes_per_value = torch.empty((), dtype=dtype).element_size()
return num_layers * 2 * num_key_value_heads * tokens * head_dim * bytes_per_value
def clear_cuda():
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
torch.cuda.ipc_collect()
def greedy_decode_from_prefill(model, prefill, input_ids, max_new_tokens):
past_key_values = prefill.past_key_values
next_token = prefill.logits[:, -1, :].argmax(dim=-1, keepdim=True)
generated = [input_ids, next_token]
for _ in range(max_new_tokens - 1):
out = model(
input_ids=next_token,
past_key_values=past_key_values,
use_cache=True,
)
past_key_values = out.past_key_values
next_token = out.logits[:, -1, :].argmax(dim=-1, keepdim=True)
generated.append(next_token)
return torch.cat(generated, dim=-1)
prompt = (
"Remember this retrieval key: RIVER-4821. "
+ "The notebook contains many irrelevant meeting notes. " * 180
+ "Question: what is the retrieval key?"
)
base_model = AutoModelForCausalLM.from_pretrained(
base_model_id,
**model_kwargs,
).eval()
inputs = tokenizer(prompt, return_tensors="pt").to(first_parameter_device(base_model))
with torch.no_grad():
base_out = base_model(**inputs, use_cache=True)
base_cache_bytes = cache_nbytes(base_out.past_key_values)
if base_cache_bytes == 0:
base_cache_bytes = dense_kv_cache_nbytes(
base_model.config,
inputs["input_ids"].shape[-1],
next(base_model.parameters()).dtype,
)
base_cache_mib = base_cache_bytes / 2**20
del base_out, inputs, base_model
clear_cuda()
duo_model = AutoModelForCausalLM.from_pretrained(
adapter_repo,
**model_kwargs,
).eval()
duo_inputs = tokenizer(prompt, return_tensors="pt").to(first_parameter_device(duo_model))
with torch.no_grad():
duo_out = duo_model(**duo_inputs, use_cache=True)
generated = greedy_decode_from_prefill(duo_model, duo_out, duo_inputs["input_ids"], 64)
duo_cache_mib = cache_nbytes(duo_out.past_key_values) / 2**20
print(f"Base KV cache: {base_cache_mib:.2f} MiB")
print(f"Duo KV cache: {cache_nbytes(duo_out.past_key_values) / 2**20:.2f} MiB")
print(f"KV reduction: {100 * (1 - duo_cache_mib / base_cache_mib):.1f}%")
print(tokenizer.decode(generated[0], skip_special_tokens=True))
Use token=True after hf auth login, or pass a token string directly for
private or gated repositories.
- Downloads last month
- 134
Model tree for dogeplusplus/duo-laguna-adapter
Base model
poolside/Laguna-XS.2