YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
BitNet v75 — strict 1-bit attention LM
A from-scratch transformer LM where every weight matrix is strict ±1 (stored as 1 bit) and attention is Gumbel-hard pointer attention (each query attends to exactly one key). Trained on FineWeb-Edu.
Models
| Model | Params | Val BPC | Status |
|---|---|---|---|
| v75 75M | 75M | 6.16 | Trained on 15B tokens, deployed |
| v75 300M | 315M | training (currently ~6.4) | In-flight on 15B-token run |
The 75M model is on HuggingFace at hidude562/bitnet-v75-75m.
Architecture
Every layer is BitLinearScaled or BitLinearScaledRaw:
- Weights: ±1 (sign-STE on a latent float)
- Per-channel α (float scale)
- Threshold/bias (float offset, one per output channel)
- Output: ±1 via
sign_ste_clipped(or float for residual-targeted layers)
Residual stream is float (RMSNorm γ between blocks). Attention is
Gumbel-hard one-hot pointer: each query selects exactly one key via
argmax((QK^T - alibi + Gumbel)/τ), with STE for the soft path through
backward.
See model_v47b.py, model_v47.py, model_v16.py for the model classes.
Repo layout
| Dir/file | Purpose |
|---|---|
model_v47b.py, model_v47.py, model_v16.py, model.py |
Model + binary primitives |
flex_gumbel_attn.py |
FlexAttention-based Gumbel-hard attention (1.6× over eager) |
triton_gumbel_hard_attn_v2.py, *_op.py, *_triop.py |
Triton-kernel variants (slower than FlexAttention; kept for reference) |
train_300m_flex.py |
Production 300M training script (FlexAttention + bs=1 ga=32 single-GPU) |
train_fineweb_v75_300m.py, train_fineweb_v75_300m_fsdp2.py |
Multi-GPU 300M trainers (DeepSpeed ZeRO-2 / FSDP2) |
train_fineweb_v75_continue_lowlr.py |
75M continuation script |
train_fineweb_v75_accelerate.py |
HuggingFace Accelerate-native trainer |
compact_v47b.py |
Compact .pt → .bit2 (deployment format, ~180× smaller) |
infer_sample.py, chat_app.py, text_app.py |
Inference wrappers (call C binary) |
infer_v47_v75.c, infer_v47.c |
C inference kernels (XNOR-popcount + AVX512) |
bench_*.py |
Throughput / quality micro-benchmarks |
_test_*.py |
Correctness tests |
*.md |
Architecture + scaling + bottleneck reports |
Key optimizations (vs naive eager)
| Optimization | Speedup | Caveat |
|---|---|---|
FlexAttention with score_mod (75M) |
1.6× | Requires torch._inductor.ir.FlexibleLayout.allow_indexing = True |
| FlexAttention memory savings (300M) | +11.6% (bs=1→bs=4) | Frees ~10 GB peak vs eager |
DataLoaderConfiguration(non_blocking=True) |
+4.2% | Free in Accelerate path |
| Optimizer-state preservation in resume | (correctness) | Fixes plateau bug |
See intern_brief_architecture_optimizations.md for the architecture-specific
optimization research questions.
Build C inference
gcc -O3 -march=native -ffast-math -o infer_v47_v75 infer_v47_v75.c -lm
For AVX-512 build use infer_v47_avx512.c.
Train from scratch (single 5090)
python train_300m_flex.py
Defaults: 315M params, bs=1 ga=32, 234,375 steps = 15B tokens. Throughput: ~24K tok/s steady-state. ETA: ~7 days.
License
Apache 2.0