YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Matilda-Mini

A sub-200M-parameter language model trained from scratch β€” and, more to the point, the training infrastructure around it: distributed-ready training loop, crash-safe checkpoint/resume, fault tolerance, observability, and a verifiable data pipeline. Built from first principles in PyTorch, no training frameworks.

Not a fine-tune. Not a wrapper. Random init β†’ a working LM, trained by code in this repo. The model is standard-modern; the systems work is the point.

Why this exists

This is a portfolio project for an LLM training-infrastructure role. The interesting problems in training large models aren't the architecture (well understood) β€” they're the systems: making multi-day runs reliable, resumable, observable, and fast on the hardware you have. So this repo is deliberately weighted toward operational excellence over architectural novelty.

Architecture (src/matilda/model.py)

A modern dense decoder-only transformer β€” the same recipe as Llama/Qwen-class models, at ~124M params:

Component Choice
Positions RoPE (rotary, half-rotation convention)
Normalization RMSNorm, pre-norm, fp32 reduction
MLP SwiGLU (2/3 sizing)
Attention GQA (12 query / 4 KV heads) + QK-Norm
Stability residual-projection init scaled 1/√(2·n_layers); optional attn logit soft-cap
Tying embedding ↔ LM head
Size 114M total / 75.5M non-embedding Β· d=768 Β· 12 layers Β· seq 1024 Β· GPT-2 50k vocab

Training infrastructure (the actual deliverable)

Capability Where What it does
Bit-for-bit resume checkpoint.py atomic writes; saves model+opt+sched+step+RNG+dataloader position; a killed run resumes to a loss curve identical to the uninterrupted one (< 1e-6, tested)
Fault tolerance train.py NaN/Inf guard (skip+log+abort-after-N); SIGTERM β†’ checkpoint-and-exit for spot-instance death
Observability monitor.py MFU (incl. attention FLOPs), tokens/s, rolling step-time (catches throttling), grad-norm, peak GPU mem β†’ always-on metrics.jsonl + optional W&B
Throughput train.py bf16 autocast, Flash-SDPA, torch.compile, fused AdamW, TF32, pinned/non-blocking H2D, grad-accum with DDP no_sync
Data pipeline data.py, scripts/prepare_data.py streams FineWeb-Edu β†’ tokenizes β†’ uint16 shards with SHA-256 manifest; mmap'd, resumable BinStream
Optimizers optim.py AdamW (correct param-group decay) + Muon (Newton-Schulz orthogonalization, hybrid with AdamW)
Reproducibility train.py full config + git SHA logged per run; deterministic seeding

Results

Validated (RTX 3090): 30/30 tests pass on GPU, smoke + bit-for-bit resume clean, 53.4% MFU at batch_size=24 with torch.compile (BSβ‰₯28 OOMs on the vocab projection β€” the expected memory hotspot).

Training run + ablations: pending the A100 run. The ablation harness (scripts/ablate.py) emits docs/ABLATIONS.md β€” a controlled comparison, one change per row:

Variant What it isolates
baseline full modern stack
no_qk_norm QK-Norm's stability contribution
mha / mqa GQA ratio vs full multi-head / multi-query
muon Muon vs AdamW convergence

Target (124M, ~3B tokens, vs Pythia-160M): HellaSwag ~30-35%, ARC-easy ~40-45%, PIQA ~60%.

Quickstart

pip install -r requirements.txt          # GPU: install torch from cu124 first (see runbook)
pytest tests/ -q                          # 30 tests: correctness, resume, NaN guard, data integrity

# train (synthetic dry run, no data needed)
python run.py --config configs/calibration.json --dry-run \
  --set model.d_model=128 model.n_layers=2 train.total_steps=20 train.device=cpu train.compile=false

# real run (after tokenizing data β€” see docs/INSTANCE_RUNBOOK.md)
python scripts/prepare_data.py --out-dir data/fwedu --target-tokens 3000000000
python run.py --config configs/base_124m.json --data-dir data/fwedu

Full GPU procedure (validate β†’ calibrate β†’ ablate β†’ train β†’ eval) is in docs/INSTANCE_RUNBOOK.md.

Repository layout

src/matilda/    config, model, optim, checkpoint, monitor, data, train
scripts/        prepare_data.py (tokenize), ablate.py (experiments), launch_vast.sh
configs/        calibration.json (MFU tuning), base_124m.json (the run)
tests/          30 tests β€” model, checkpoint, train loop, data, optim, ablation, run
docs/           INSTANCE_RUNBOOK.md (operating manual)
run.py          training entrypoint (--config + --set overrides)

Testing

30 tests run on CPU in ~2 min. Highlights: overfit-single-batch (the model can learn), causal-mask-no-leak (no future-token leakage), bit-for-bit resume, NaN-skip-then-recover, shard checksum corruption detection, Muon overfit.

pytest tests/ -q
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support