BitNet v75 — strict 1-bit attention LM

A from-scratch transformer LM where every weight matrix is strict ±1 (stored as 1 bit) and attention is Gumbel-hard pointer attention (each query attends to exactly one key). Trained on FineWeb-Edu.

Models

Model	Params	Val BPC	Status
v75 75M	75M	6.16	Trained on 15B tokens, deployed
v75 300M	315M	training (currently ~6.4)	In-flight on 15B-token run

The 75M model is on HuggingFace at hidude562/bitnet-v75-75m.

Architecture

Every layer is BitLinearScaled or BitLinearScaledRaw:

Weights: ±1 (sign-STE on a latent float)
Per-channel α (float scale)
Threshold/bias (float offset, one per output channel)
Output: ±1 via sign_ste_clipped (or float for residual-targeted layers)

Residual stream is float (RMSNorm γ between blocks). Attention is Gumbel-hard one-hot pointer: each query selects exactly one key via argmax((QK^T - alibi + Gumbel)/τ), with STE for the soft path through backward.

See model_v47b.py, model_v47.py, model_v16.py for the model classes.

Repo layout

Dir/file	Purpose
`model_v47b.py`, `model_v47.py`, `model_v16.py`, `model.py`	Model + binary primitives
`flex_gumbel_attn.py`	FlexAttention-based Gumbel-hard attention (1.6× over eager)
`triton_gumbel_hard_attn_v2.py`, `_op.py`, `_triop.py`	Triton-kernel variants (slower than FlexAttention; kept for reference)
`train_300m_flex.py`	Production 300M training script (FlexAttention + bs=1 ga=32 single-GPU)
`train_fineweb_v75_300m.py`, `train_fineweb_v75_300m_fsdp2.py`	Multi-GPU 300M trainers (DeepSpeed ZeRO-2 / FSDP2)
`train_fineweb_v75_continue_lowlr.py`	75M continuation script
`train_fineweb_v75_accelerate.py`	HuggingFace Accelerate-native trainer
`compact_v47b.py`	Compact `.pt` → `.bit2` (deployment format, ~180× smaller)
`infer_sample.py`, `chat_app.py`, `text_app.py`	Inference wrappers (call C binary)
`infer_v47_v75.c`, `infer_v47.c`	C inference kernels (XNOR-popcount + AVX512)
`bench_*.py`	Throughput / quality micro-benchmarks
`_test_*.py`	Correctness tests
`*.md`	Architecture + scaling + bottleneck reports

Key optimizations (vs naive eager)

Optimization	Speedup	Caveat
FlexAttention with `score_mod` (75M)	1.6×	Requires `torch._inductor.ir.FlexibleLayout.allow_indexing = True`
FlexAttention memory savings (300M)	+11.6% (bs=1→bs=4)	Frees ~10 GB peak vs eager
`DataLoaderConfiguration(non_blocking=True)`	+4.2%	Free in Accelerate path
Optimizer-state preservation in resume	(correctness)	Fixes plateau bug

See intern_brief_architecture_optimizations.md for the architecture-specific optimization research questions.

Build C inference

gcc -O3 -march=native -ffast-math -o infer_v47_v75 infer_v47_v75.c -lm

For AVX-512 build use infer_v47_avx512.c.

Train from scratch (single 5090)

python train_300m_flex.py

Defaults: 315M params, bs=1 ga=32, 234,375 steps = 15B tokens. Throughput: ~24K tok/s steady-state. ETA: ~7 days.

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support