GLA (Gated Linear Attention) 350M (full rank) — Low-rank Fast-Weight Ablation

Pretrained 350M-parameter GLA (Gated Linear Attention) with low-rank parameterization (rfull) on FineWeb-Edu. Part of a multi-cell ablation across 4 archs × {r32, r64, r256, rfull} (plus GDN extras r512) studying whether constraining the q/k/v fast-weight projections (or LaCT's SwiGLU MLP) to low rank can match or exceed full-rank performance at the 350M scale.

Training


Architecture	GLA (Gated Linear Attention)
Rank	`rfull`
Params	~350M (hidden=1024, layers=24, heads=16)
Dataset	`HuggingFaceFW/fineweb-edu` (streaming)
Steps	10000
Effective batch	256
Sequence length	8000
Optimizer	AdamW (lr=3e-4, eps=1e-15)
LR schedule	Cosine, 512-step warmup, decay to 10%
Precision	bf16
Activation checkpointing	selective (option 1)
Tokens	~20.5 B

Eval results

FineWeb-Edu val PPL: 14.42
LAMBADA acc: 0.252
HellaSwag acc_norm: 0.356
ARC-Easy acc_norm: 0.464
ARC-Challenge acc_norm: 0.271
PIQA acc_norm: 0.630
WinoGrande acc: 0.503

Notes on the 350M sweep

Downstream eval discrimination comes online at 350M. At 100M, HellaSwag / LAMBADA were near-chance for most cells; at 350M they discriminate clearly between archs/ranks.
PPL doesn't linearly predict downstream. At matched ~374M, GLA rfull has worse FineWeb-Edu PPL than DeltaNet rfull (14.42 vs 12.55) but wins on every lm-harness task (LAMBADA, HellaSwag, PIQA, ARC-E).
GatedDeltaNet dominates at the cost of size. GDN rfull is ~~526M (head_dim=256 inflates q/k/v) and wins every metric; GDN r256 (~~432M) is the matched-param comparison and still leads.
LaCT is rank-robust at 350M. PPL/LAMBADA stay flat across r64 / r256 / rfull — the cleanest evidence for the "low rank as regularization" hypothesis.

Run name: gla_350M_rfull_bs256_lr3e-4_steps10000

Downloads last month: 22

Safetensors

Model size

0.4B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

nlproj
/

gla_350M_rfull_bs256_lr3e-4_steps10000

GLA (Gated Linear Attention) 350M (full rank) — Low-rank Fast-Weight Ablation

Training

Eval results

Notes on the 350M sweep

Dataset used to train nlproj/gla_350M_rfull_bs256_lr3e-4_steps10000