bitnet1 — 300M v76 1-bit LLM research backup

Backup of the active research VM for the 1-bit LLM ("bitnet1") project. This is a research dump, not a polished release: it preserves the in-flight state of training scripts, checkpoints, profile logs, and analysis notes.

What's here

Path	Contents
`code/`	The training and inference stack on the VM at `/root/bitnet1/code/` (model_v47.py, model_v47b.py, model_v16.py, train_300m_flex.py, analyze_pointer_health.py, etc.)
`local-source/`	The author's local `/home/nathan/1bitllm/` repo: synth experiments, benchmarks, ablation scripts, analysis notebooks, and writeups
`.pt`, `.bit2`, `_pointer_health.json`, `_induction.json` (root)	Best-of-run checkpoints + per-checkpoint probe metric sidecars — see § Checkpoints below
`logs/`	All training run logs from this session (and recent prior ones) — JSONL events per training run, plus stdout
`notes/`	Session writeups (math derivations, bonsai/ALiBi findings, truebit negative result)

Checkpoints (at repo root)

File	Size	Notes
`v76_300M_fresh_rwsp_kshift_L0_best.pt`	1.26 GB	Deployed cold-start v76: val_bpc 5.674 at step 39K, top L10H0 IPMR 10.21%, 15 specialist heads ≥5%. The headline 300M result.
`v76_300M_bonsai_extend_best.pt`	1.26 GB	Bonsai-trained v76: val_bpc 6.160 at step 15K (=2K smoke + 13K extend), top L0H0 IPMR 6.62%, 5 specialist heads ≥5%. Canonical Olsson L0 induction emerges with bonsai recipe.
`v76_300M_glm_ft_best.pt`	1.26 GB	GLM-5.1-Reasoning finetune of cold-start v76 (1500 steps, unfrozen). val_bpc 6.279 on GLM, FineWeb-probe top IPMR 3.74% (catastrophic forgetting demonstrated).
`fineweb_edu_v75_300M_flex_best.pt`	1.26 GB	300M v75 baseline (no RWSP, no K-shift), step 67K, val_bpc 6.015. Reference for v76 ablations.
`fineweb_edu_v75_75M_resumed_best.pt`	6.4 GB	75M v75 baseline (with optimizer state), step 228K, val_bpc 6.16. Long-trained reference.
`v57*_300M.bit2`, `v75_resumed.bit2`	45 MB each	Older bit-packed inference exports (deployed in chat_app/text_app).

JSON sidecars (*_pointer_health.json) preserve the exact IPMR / HDR / M4 metrics and per-(layer, head) IPMR matrix so all paper claims are verifiable from these files alone.

Recipe summary

The deployed cold-start v76 is the v75 1-bit BitNet stack (model_v47b.BitLMv47B, sign-STE quantization with per-channel α scales, RMSNorm) plus two training-time interventions:

RWSP-G attention — rank-weighted soft pointer with geometric anneal over the top-K survivors. Provides gradient-flow breadth that hard gumbel-argmax lacks (use_rwsp=True, rwsp_k=4).
K-shift L0 — at attention layer 0, K[t] = k_proj(x[t] + x[t-1]) instead of just x[t]. Zero new parameters; Olsson 2022 prev-token signal injected as a data-routing change (use_kshift=True, kshift_layers=[0]).

Bonsai variant additionally anneals the weight quantization mix from 0 (fp32) to 1 (sign-STE) over the first 60% of training (--anneal-binarize --binarize-anneal-end-frac 0.6). See notes/session_2026-05-07_bonsai_alibi_finetune.md for the full result.

Key findings (in `notes/`)

session_2026-05-07_bonsai_alibi_finetune.md — three landmark findings in one writeup: (1) ALiBi was blocking all prior synth induction probes (--no-alibi recovers it); (2) bonsai works for 1-bit when ALiBi removed (reverses prior memo); (3) finetune destroys 1-bit induction circuits regardless of freeze pattern (more freezing → worse).
truebit_no_latent_negative.md — negative result on training without fp32 latent weights. Flip-counter rule loses too much gradient signal; the right path to <10% float compute is custom 1-bit kernels, not removing fp32 latents.
v76_math_derivation.md — Q-shift / K-shift bilinear depth analysis (the math that motivated K-shift L0; also contains the refuted Q-shift collapse argument).

Reproducing claims

Each training run's eval and probe metrics are in two places:

logs/v76_300M_*.out — JSONL events with {event, step, val_bpc, ...}
ckpt/*_pointer_health.json — full IPMR matrix at the saved best.pt step

To re-probe a checkpoint:

# On a CUDA host with the codebase installed
python code/analyze_pointer_health.py \
    --ckpt v76_300M_fresh_rwsp_kshift_L0_best.pt \
    --device cpu --n-batches 4 --batch-size 4 --seq-len 512

License

MIT for code; checkpoints are derivative of FineWeb-Edu (ODC-By 1.0) and the GLM-5.1-Reasoning-1M-Cleaned dataset — see those datasets' licenses for downstream restrictions on the GLM-finetuned checkpoint specifically.

Active work (as of upload time)

Pivot in progress to "entirely 1-bit training" — defined as <10% wall-clock spent on float ops. Initial truebit (no-fp32-latent) prototype showed the flip-counter rule loses too much gradient signal at synth scale. Profiled 300M v76 training: GEMMs are ~44% of CUDA time, RWSP topk is 19%, ALiBi/causal mask is 13%. RTX 5090 int8 GEMM is slower than bf16 at our shape (dispatch-bound). The path to <10% requires custom XNOR/popcount kernels, not int8 substitution.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support