Redesign Mixture-of-Experts Routers with Manifold Power Iteration Paper • 2606.12397 • Published 5 days ago • 84
Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It Paper • 2606.11052 • Published 6 days ago • 15
FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention Paper • 2606.09079 • Published 7 days ago • 61
Less is More: Recursive Reasoning with Tiny Networks Paper • 2510.04871 • Published Oct 6, 2025 • 516 • 43
Less is More: Recursive Reasoning with Tiny Networks Paper • 2510.04871 • Published Oct 6, 2025 • 516
Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding Paper • 2605.29707 • Published 18 days ago • 145
NITP: Next Implicit Token Prediction for LLM Pre-training Paper • 2605.24956 • Published 22 days ago • 35
LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws Paper • 2605.23901 • Published 24 days ago • 13
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information Paper • 2605.11609 • Published May 12 • 195
HRM-Text: Efficient Pretraining Beyond Scaling Paper • 2605.20613 • Published 26 days ago • 315
Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention Paper • 2605.22791 • Published 25 days ago • 31