S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models
Abstract
S0 tuning optimizes recurrent state matrices for hybrid language models, achieving superior performance over LoRA with zero inference overhead and enabling efficient task switching.
Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval. The method, which we call S0 tuning, optimizes one state matrix per recurrent layer while freezing all model weights. On Qwen3.5-4B (GatedDeltaNet hybrid), S0 tuning improves greedy pass@1 by +23.6 +/- 1.7 pp (10 seeds). On FalconH1-7B (Mamba-2 hybrid), S0 reaches 71.8% +/- 1.3 and LoRA reaches 71.4% +/- 2.4 (3 seeds), statistically indistinguishable at this sample size while requiring no weight merging. Cross-domain transfer is significant on MATH-500 (+4.8 pp, p = 0.00002, 8 seeds) and GSM8K (+2.8 pp, p = 0.0003, 10 seeds); a text-to-SQL benchmark (Spider) shows no transfer, consistent with the trajectory-steering mechanism. A prefix-tuning control on a pure Transformer (Qwen2.5-3B) degrades performance by -13.9 pp under all nine configurations tested. On Qwen3.5, a per-step state-offset variant reaches +27.1 pp, above both S0 and LoRA but with per-step inference cost. Taken together, the results show that recurrent state initialization is a strong zero-inference-overhead PEFT surface for hybrid language models when verified supervision is scarce. The tuned state is a ~48 MB file; task switching requires no weight merging or model reload. Code and library: https://github.com/jackyoung27/s0-tuning.
Community
S0 Tuning optimizes only the initial recurrent state of hybrid models (GatedDeltaNet, Mamba-2) while freezing all weights. On Qwen3.5-4B: +23.6pp HumanEval, +10.8pp over LoRA, zero inference overhead. Works on FalconH1-7B as well. The core implementation is ~50 lines.
pip install s0-tuning | Code: github.com/JackYoung27/s0-tuning
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Low-Rank Adaptation Reduces Catastrophic Forgetting in Sequential Transformer Encoder Fine-Tuning: Controlled Empirical Evidence and Frozen-Backbone Representation Probes (2026)
- AdapterTune: Zero-Initialized Low-Rank Adapters for Frozen Vision Transformers (2026)
- Unsupervised Layer-Wise Dynamic Test Time Adaptation for LLMs (2026)
- Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model (2026)
- Training Large Reasoning Models Efficiently via Progressive Thought Encoding (2026)
- Effective Distillation to Hybrid xLSTM Architectures (2026)
- MemDLM: Memory-Enhanced DLM Training (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.01168 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper