bitnet1 β 300M v76 1-bit LLM research backup
Backup of the active research VM for the 1-bit LLM ("bitnet1") project. This is a research dump, not a polished release: it preserves the in-flight state of training scripts, checkpoints, profile logs, and analysis notes.
What's here
| Path | Contents |
|---|---|
code/ |
The training and inference stack on the VM at /root/bitnet1/code/ (model_v47.py, model_v47b.py, model_v16.py, train_300m_flex.py, analyze_pointer_health.py, etc.) |
local-source/ |
The author's local /home/nathan/1bitllm/ repo: synth experiments, benchmarks, ablation scripts, analysis notebooks, and writeups |
*.pt, *.bit2, *_pointer_health.json, *_induction.json (root) |
Best-of-run checkpoints + per-checkpoint probe metric sidecars β see Β§ Checkpoints below |
logs/ |
All training run logs from this session (and recent prior ones) β JSONL events per training run, plus stdout |
notes/ |
Session writeups (math derivations, bonsai/ALiBi findings, truebit negative result) |
Checkpoints (at repo root)
| File | Size | Notes |
|---|---|---|
v76_300M_fresh_rwsp_kshift_L0_best.pt |
1.26 GB | Deployed cold-start v76: val_bpc 5.674 at step 39K, top L10H0 IPMR 10.21%, 15 specialist heads β₯5%. The headline 300M result. |
v76_300M_bonsai_extend_best.pt |
1.26 GB | Bonsai-trained v76: val_bpc 6.160 at step 15K (=2K smoke + 13K extend), top L0H0 IPMR 6.62%, 5 specialist heads β₯5%. Canonical Olsson L0 induction emerges with bonsai recipe. |
v76_300M_glm_ft_best.pt |
1.26 GB | GLM-5.1-Reasoning finetune of cold-start v76 (1500 steps, unfrozen). val_bpc 6.279 on GLM, FineWeb-probe top IPMR 3.74% (catastrophic forgetting demonstrated). |
fineweb_edu_v75_300M_flex_best.pt |
1.26 GB | 300M v75 baseline (no RWSP, no K-shift), step 67K, val_bpc 6.015. Reference for v76 ablations. |
fineweb_edu_v75_75M_resumed_best.pt |
6.4 GB | 75M v75 baseline (with optimizer state), step 228K, val_bpc 6.16. Long-trained reference. |
v57*_300M.bit2, v75_resumed.bit2 |
45 MB each | Older bit-packed inference exports (deployed in chat_app/text_app). |
JSON sidecars (*_pointer_health.json) preserve the exact IPMR / HDR / M4
metrics and per-(layer, head) IPMR matrix so all paper claims are verifiable
from these files alone.
Recipe summary
The deployed cold-start v76 is the v75 1-bit BitNet stack
(model_v47b.BitLMv47B, sign-STE quantization with per-channel Ξ± scales,
RMSNorm) plus two training-time interventions:
- RWSP-G attention β rank-weighted soft pointer with geometric anneal
over the top-K survivors. Provides gradient-flow breadth that hard
gumbel-argmax lacks (
use_rwsp=True, rwsp_k=4). - K-shift L0 β at attention layer 0,
K[t] = k_proj(x[t] + x[t-1])instead of justx[t]. Zero new parameters; Olsson 2022 prev-token signal injected as a data-routing change (use_kshift=True, kshift_layers=[0]).
Bonsai variant additionally anneals the weight quantization mix from 0
(fp32) to 1 (sign-STE) over the first 60% of training
(--anneal-binarize --binarize-anneal-end-frac 0.6). See
notes/session_2026-05-07_bonsai_alibi_finetune.md for the full result.
Key findings (in notes/)
session_2026-05-07_bonsai_alibi_finetune.mdβ three landmark findings in one writeup: (1) ALiBi was blocking all prior synth induction probes (--no-alibi recovers it); (2) bonsai works for 1-bit when ALiBi removed (reverses prior memo); (3) finetune destroys 1-bit induction circuits regardless of freeze pattern (more freezing β worse).truebit_no_latent_negative.mdβ negative result on training without fp32 latent weights. Flip-counter rule loses too much gradient signal; the right path to <10% float compute is custom 1-bit kernels, not removing fp32 latents.v76_math_derivation.mdβ Q-shift / K-shift bilinear depth analysis (the math that motivated K-shift L0; also contains the refuted Q-shift collapse argument).
Reproducing claims
Each training run's eval and probe metrics are in two places:
logs/v76_300M_*.outβ JSONL events with{event, step, val_bpc, ...}ckpt/*_pointer_health.jsonβ full IPMR matrix at the saved best.pt step
To re-probe a checkpoint:
# On a CUDA host with the codebase installed
python code/analyze_pointer_health.py \
--ckpt v76_300M_fresh_rwsp_kshift_L0_best.pt \
--device cpu --n-batches 4 --batch-size 4 --seq-len 512
License
MIT for code; checkpoints are derivative of FineWeb-Edu (ODC-By 1.0) and the GLM-5.1-Reasoning-1M-Cleaned dataset β see those datasets' licenses for downstream restrictions on the GLM-finetuned checkpoint specifically.
Active work (as of upload time)
Pivot in progress to "entirely 1-bit training" β defined as <10% wall-clock spent on float ops. Initial truebit (no-fp32-latent) prototype showed the flip-counter rule loses too much gradient signal at synth scale. Profiled 300M v76 training: GEMMs are ~44% of CUDA time, RWSP topk is 19%, ALiBi/causal mask is 13%. RTX 5090 int8 GEMM is slower than bf16 at our shape (dispatch-bound). The path to <10% requires custom XNOR/popcount kernels, not int8 substitution.