HuggingFaceFW/fineweb-edu
Viewer โข Updated โข 3.5B โข 616k โข 1.09k
Pretrained 350M-parameter GatedDeltaNet with low-rank parameterization
(r512) on FineWeb-Edu. Part of a multi-cell ablation across
4 archs ร {r32, r64, r256, rfull} (plus GDN extras r512)
studying whether constraining the q/k/v fast-weight projections (or LaCT's
SwiGLU MLP) to low rank can match or exceed full-rank performance at the
350M scale.
| Architecture | GatedDeltaNet |
| Rank | r512 |
| Params | ~350M (hidden=1024, layers=24, heads=16) |
| Dataset | HuggingFaceFW/fineweb-edu (streaming) |
| Steps | 10000 |
| Effective batch | 256 |
| Sequence length | 8000 |
| Optimizer | AdamW (lr=3e-4, eps=1e-15) |
| LR schedule | Cosine, 512-step warmup, decay to 10% |
| Precision | bf16 |
| Activation checkpointing | selective (option 1) |
| Tokens | ~20.5 B |
12.47rfull
has worse FineWeb-Edu PPL than DeltaNet rfull (14.42 vs 12.55) but wins on
every lm-harness task (LAMBADA, HellaSwag, PIQA, ARC-E).rfull is r256 (Run name: gated_deltanet_350M_r512_bs256_lr3e-4_steps10000