HySparse: A Hybrid Sparse Attention Architecture with Oracle Token Selection and KV Cache Sharing
Abstract
Hybrid Sparse Attention architecture interleaves full and sparse attention layers, using full attention output to guide sparse layer token selection and cache reuse for improved efficiency and performance.
This work introduces Hybrid Sparse Attention (HySparse), a new architecture that interleaves each full attention layer with several sparse attention layers. While conceptually simple, HySparse strategically derives each sparse layer's token selection and KV caches directly from the preceding full attention layer. This architecture resolves two fundamental limitations of prior sparse attention methods. First, conventional approaches typically rely on additional proxies to predict token importance, introducing extra complexity and potentially suboptimal performance. In contrast, HySparse uses the full attention layer as a precise oracle to identify important tokens. Second, existing sparse attention designs often reduce computation without saving KV cache. HySparse enables sparse attention layers to reuse the full attention KV cache, thereby reducing both computation and memory. We evaluate HySparse on both 7B dense and 80B MoE models. Across all settings, HySparse consistently outperforms both full attention and hybrid SWA baselines. Notably, in the 80B MoE model with 49 total layers, only 5 layers employ full attention, yet HySparse achieves substantial performance gains while reducing KV cache storage by nearly 10x.
Community
Efficient LLM Architecture, Sparse Attention, Hybrid Architecture
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- STILL: Selecting Tokens for Intra-Layer Hybrid Attention to Linearize LLMs (2026)
- HyLRA: Hybrid Layer Reuse Attention for Efficient Long-Context Inference (2026)
- Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction (2026)
- SPLA: Block Sparse Plus Linear Attention for Long Context Modeling (2026)
- BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding (2025)
- Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection (2026)
- Accelerate Speculative Decoding with Sparse Computation in Verification (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper