Instructions to use dogeplusplus/duo-laguna-adapter with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use dogeplusplus/duo-laguna-adapter with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="dogeplusplus/duo-laguna-adapter", trust_remote_code=True)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("dogeplusplus/duo-laguna-adapter", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("dogeplusplus/duo-laguna-adapter", trust_remote_code=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use dogeplusplus/duo-laguna-adapter with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "dogeplusplus/duo-laguna-adapter"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dogeplusplus/duo-laguna-adapter",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/dogeplusplus/duo-laguna-adapter

SGLang

How to use dogeplusplus/duo-laguna-adapter with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "dogeplusplus/duo-laguna-adapter" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dogeplusplus/duo-laguna-adapter",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "dogeplusplus/duo-laguna-adapter" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dogeplusplus/duo-laguna-adapter",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use dogeplusplus/duo-laguna-adapter with Docker Model Runner:
```
docker model run hf.co/dogeplusplus/duo-laguna-adapter
```

DuoAttention Laguna Adapter

This repository contains adapter-only DuoAttention head weights for poolside/Laguna-XS.2. It does not include the Laguna base weights or tokenizer.

DuoAttention reduces long-context KV-cache growth by learning which KV heads need full history and letting the remaining heads keep only a sink window plus recent tokens. This Laguna adapter loads the base model, applies the learned head mask from duo_attention/full_attention_heads.pt, and enables DuoAttention with sink size 64 and recent size 256.

Why Use It

Smaller KV cache for long prompts and generation.
Adapter-only distribution, so the base Laguna model remains separate.
trust_remote_code=True loading applies the Laguna DuoAttention patch.
Laguna-specific support for gated attention projections and KV-head reordering.

Figures From The DuoAttention Paper

DuoAttention retrieval and streaming head split

DuoAttention full and streaming KV-cache pattern

DuoAttention KV-cache capacity comparison

Paper: DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads.

Laguna Adapter Figure

This figure visualizes the optimized Laguna DuoAttention gating values produced for this adapter.

Laguna optimized DuoAttention gating values

Laguna Results

We ran a mixed-precision KV benchmark as a Hugging Face Job on poolside/Laguna-XS.2. The base line accounts for dense Laguna FP8 KV cache; the DuoAttention path stores retrieval heads as FP8 and streaming heads as packed INT4 with per-group scale/zero-point metadata.

Prompt	Decode	Base KV	Duo KV	KV Reduction
512	1	40.08 MiB	24.03 MiB	40.04%
512	16	41.25 MiB	24.50 MiB	40.61%
512	64	45.00 MiB	26.00 MiB	42.22%
1,024	1	80.08 MiB	40.03 MiB	50.01%
1,024	16	81.25 MiB	40.50 MiB	50.15%
1,024	64	85.00 MiB	42.00 MiB	50.59%
1,462	1	114.30 MiB	53.72 MiB	53.00%
1,462	16	115.47 MiB	54.19 MiB	53.07%
1,462	64	119.22 MiB	55.69 MiB	53.29%

Laguna DuoAttention mixed KV cache reduction

Job: 6a1ab49e5c8d10ffa11088c0

W&B run: ox2c0m6s

Laguna-Specific Changes

Ported DuoAttention from Llama/Mistral-style attention modules to Laguna's gated attention structure.
Preserved Laguna's g_proj gated output path when splitting full-context and streaming heads.
Reordered Laguna Q/K/V/gating/output projections so full and streaming KV heads remain aligned after patching.
Added adapter-only loading that fetches the base Laguna model separately and applies the learned full_attention_heads tensor at load time.
Kept decode compatible with the patched tuple KV cache path used by the current Laguna remote code.

Usage

Install optional tokenizer dependencies if needed:

pip install sentencepiece tiktoken

Load the base tokenizer and compare the base Laguna cache with the DuoAttention cache on the same non-trivial prompt:

import gc
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

adapter_repo = "dogeplusplus/duo-laguna-adapter"
base_model_id = "poolside/Laguna-XS.2"

tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    trust_remote_code=True,
    token=True,
)
model_kwargs = {
    "trust_remote_code": True,
    "token": True,
}
if torch.cuda.is_available():
    model_kwargs["dtype"] = torch.bfloat16
    model_kwargs["device_map"] = {"": "cuda:0"}
else:
    model_kwargs["torch_dtype"] = "auto"
    model_kwargs["device_map"] = "auto"


def cache_nbytes(value):
    if value is None:
        return 0
    if torch.is_tensor(value):
        return value.numel() * value.element_size()
    if hasattr(value, "key_cache") and hasattr(value, "value_cache"):
        return cache_nbytes(value.key_cache) + cache_nbytes(value.value_cache)
    if hasattr(value, "to_legacy_cache"):
        try:
            return cache_nbytes(value.to_legacy_cache())
        except Exception:
            pass
    if isinstance(value, dict):
        return sum(cache_nbytes(v) for v in value.values())
    if isinstance(value, (list, tuple)):
        return sum(cache_nbytes(v) for v in value)
    return 0


def first_parameter_device(model):
    return next(model.parameters()).device


def dense_kv_cache_nbytes(config, tokens, dtype):
    num_layers = config.num_hidden_layers
    num_key_value_heads = config.num_key_value_heads
    head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
    bytes_per_value = torch.empty((), dtype=dtype).element_size()
    return num_layers * 2 * num_key_value_heads * tokens * head_dim * bytes_per_value


def clear_cuda():
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.ipc_collect()


def greedy_decode_from_prefill(model, prefill, input_ids, max_new_tokens):
    past_key_values = prefill.past_key_values
    next_token = prefill.logits[:, -1, :].argmax(dim=-1, keepdim=True)
    generated = [input_ids, next_token]
    for _ in range(max_new_tokens - 1):
        out = model(
            input_ids=next_token,
            past_key_values=past_key_values,
            use_cache=True,
        )
        past_key_values = out.past_key_values
        next_token = out.logits[:, -1, :].argmax(dim=-1, keepdim=True)
        generated.append(next_token)
    return torch.cat(generated, dim=-1)


prompt = (
    "Remember this retrieval key: RIVER-4821. "
    + "The notebook contains many irrelevant meeting notes. " * 180
    + "Question: what is the retrieval key?"
)

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    **model_kwargs,
).eval()
inputs = tokenizer(prompt, return_tensors="pt").to(first_parameter_device(base_model))
with torch.no_grad():
    base_out = base_model(**inputs, use_cache=True)
base_cache_bytes = cache_nbytes(base_out.past_key_values)
if base_cache_bytes == 0:
    base_cache_bytes = dense_kv_cache_nbytes(
        base_model.config,
        inputs["input_ids"].shape[-1],
        next(base_model.parameters()).dtype,
    )
base_cache_mib = base_cache_bytes / 2**20
del base_out, inputs, base_model
clear_cuda()

duo_model = AutoModelForCausalLM.from_pretrained(
    adapter_repo,
    **model_kwargs,
).eval()
duo_inputs = tokenizer(prompt, return_tensors="pt").to(first_parameter_device(duo_model))
with torch.no_grad():
    duo_out = duo_model(**duo_inputs, use_cache=True)
    generated = greedy_decode_from_prefill(duo_model, duo_out, duo_inputs["input_ids"], 64)
duo_cache_mib = cache_nbytes(duo_out.past_key_values) / 2**20

print(f"Base KV cache: {base_cache_mib:.2f} MiB")
print(f"Duo KV cache:  {cache_nbytes(duo_out.past_key_values) / 2**20:.2f} MiB")
print(f"KV reduction:   {100 * (1 - duo_cache_mib / base_cache_mib):.1f}%")

print(tokenizer.decode(generated[0], skip_special_tokens=True))

Use token=True after hf auth login, or pass a token string directly for private or gated repositories.

Downloads last month: 134

Model tree for dogeplusplus/duo-laguna-adapter

Base model

poolside/Laguna-XS.2

Finetuned

(23)

this model

Paper for dogeplusplus/duo-laguna-adapter

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Paper • 2410.10819 • Published Oct 14, 2024 • 7