bitnet1 β€” 300M v76 1-bit LLM research backup

Backup of the active research VM for the 1-bit LLM ("bitnet1") project. This is a research dump, not a polished release: it preserves the in-flight state of training scripts, checkpoints, profile logs, and analysis notes.

What's here

Path Contents
code/ The training and inference stack on the VM at /root/bitnet1/code/ (model_v47.py, model_v47b.py, model_v16.py, train_300m_flex.py, analyze_pointer_health.py, etc.)
local-source/ The author's local /home/nathan/1bitllm/ repo: synth experiments, benchmarks, ablation scripts, analysis notebooks, and writeups
*.pt, *.bit2, *_pointer_health.json, *_induction.json (root) Best-of-run checkpoints + per-checkpoint probe metric sidecars β€” see Β§ Checkpoints below
logs/ All training run logs from this session (and recent prior ones) β€” JSONL events per training run, plus stdout
notes/ Session writeups (math derivations, bonsai/ALiBi findings, truebit negative result)

Checkpoints (at repo root)

File Size Notes
v76_300M_fresh_rwsp_kshift_L0_best.pt 1.26 GB Deployed cold-start v76: val_bpc 5.674 at step 39K, top L10H0 IPMR 10.21%, 15 specialist heads β‰₯5%. The headline 300M result.
v76_300M_bonsai_extend_best.pt 1.26 GB Bonsai-trained v76: val_bpc 6.160 at step 15K (=2K smoke + 13K extend), top L0H0 IPMR 6.62%, 5 specialist heads β‰₯5%. Canonical Olsson L0 induction emerges with bonsai recipe.
v76_300M_glm_ft_best.pt 1.26 GB GLM-5.1-Reasoning finetune of cold-start v76 (1500 steps, unfrozen). val_bpc 6.279 on GLM, FineWeb-probe top IPMR 3.74% (catastrophic forgetting demonstrated).
fineweb_edu_v75_300M_flex_best.pt 1.26 GB 300M v75 baseline (no RWSP, no K-shift), step 67K, val_bpc 6.015. Reference for v76 ablations.
fineweb_edu_v75_75M_resumed_best.pt 6.4 GB 75M v75 baseline (with optimizer state), step 228K, val_bpc 6.16. Long-trained reference.
v57*_300M.bit2, v75_resumed.bit2 45 MB each Older bit-packed inference exports (deployed in chat_app/text_app).

JSON sidecars (*_pointer_health.json) preserve the exact IPMR / HDR / M4 metrics and per-(layer, head) IPMR matrix so all paper claims are verifiable from these files alone.

Recipe summary

The deployed cold-start v76 is the v75 1-bit BitNet stack (model_v47b.BitLMv47B, sign-STE quantization with per-channel Ξ± scales, RMSNorm) plus two training-time interventions:

  1. RWSP-G attention β€” rank-weighted soft pointer with geometric anneal over the top-K survivors. Provides gradient-flow breadth that hard gumbel-argmax lacks (use_rwsp=True, rwsp_k=4).
  2. K-shift L0 β€” at attention layer 0, K[t] = k_proj(x[t] + x[t-1]) instead of just x[t]. Zero new parameters; Olsson 2022 prev-token signal injected as a data-routing change (use_kshift=True, kshift_layers=[0]).

Bonsai variant additionally anneals the weight quantization mix from 0 (fp32) to 1 (sign-STE) over the first 60% of training (--anneal-binarize --binarize-anneal-end-frac 0.6). See notes/session_2026-05-07_bonsai_alibi_finetune.md for the full result.

Key findings (in notes/)

  • session_2026-05-07_bonsai_alibi_finetune.md β€” three landmark findings in one writeup: (1) ALiBi was blocking all prior synth induction probes (--no-alibi recovers it); (2) bonsai works for 1-bit when ALiBi removed (reverses prior memo); (3) finetune destroys 1-bit induction circuits regardless of freeze pattern (more freezing β†’ worse).
  • truebit_no_latent_negative.md β€” negative result on training without fp32 latent weights. Flip-counter rule loses too much gradient signal; the right path to <10% float compute is custom 1-bit kernels, not removing fp32 latents.
  • v76_math_derivation.md β€” Q-shift / K-shift bilinear depth analysis (the math that motivated K-shift L0; also contains the refuted Q-shift collapse argument).

Reproducing claims

Each training run's eval and probe metrics are in two places:

  • logs/v76_300M_*.out β€” JSONL events with {event, step, val_bpc, ...}
  • ckpt/*_pointer_health.json β€” full IPMR matrix at the saved best.pt step

To re-probe a checkpoint:

# On a CUDA host with the codebase installed
python code/analyze_pointer_health.py \
    --ckpt v76_300M_fresh_rwsp_kshift_L0_best.pt \
    --device cpu --n-batches 4 --batch-size 4 --seq-len 512

License

MIT for code; checkpoints are derivative of FineWeb-Edu (ODC-By 1.0) and the GLM-5.1-Reasoning-1M-Cleaned dataset β€” see those datasets' licenses for downstream restrictions on the GLM-finetuned checkpoint specifically.

Active work (as of upload time)

Pivot in progress to "entirely 1-bit training" β€” defined as <10% wall-clock spent on float ops. Initial truebit (no-fp32-latent) prototype showed the flip-counter rule loses too much gradient signal at synth scale. Profiled 300M v76 training: GEMMs are ~44% of CUDA time, RWSP topk is 19%, ALiBi/causal mask is 13%. RTX 5090 int8 GEMM is slower than bf16 at our shape (dispatch-bound). The path to <10% requires custom XNOR/popcount kernels, not int8 substitution.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support