YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

BitNet v75 — strict 1-bit attention LM

A from-scratch transformer LM where every weight matrix is strict ±1 (stored as 1 bit) and attention is Gumbel-hard pointer attention (each query attends to exactly one key). Trained on FineWeb-Edu.

Models

Model Params Val BPC Status
v75 75M 75M 6.16 Trained on 15B tokens, deployed
v75 300M 315M training (currently ~6.4) In-flight on 15B-token run

The 75M model is on HuggingFace at hidude562/bitnet-v75-75m.

Architecture

Every layer is BitLinearScaled or BitLinearScaledRaw:

  • Weights: ±1 (sign-STE on a latent float)
  • Per-channel α (float scale)
  • Threshold/bias (float offset, one per output channel)
  • Output: ±1 via sign_ste_clipped (or float for residual-targeted layers)

Residual stream is float (RMSNorm γ between blocks). Attention is Gumbel-hard one-hot pointer: each query selects exactly one key via argmax((QK^T - alibi + Gumbel)/τ), with STE for the soft path through backward.

See model_v47b.py, model_v47.py, model_v16.py for the model classes.

Repo layout

Dir/file Purpose
model_v47b.py, model_v47.py, model_v16.py, model.py Model + binary primitives
flex_gumbel_attn.py FlexAttention-based Gumbel-hard attention (1.6× over eager)
triton_gumbel_hard_attn_v2.py, *_op.py, *_triop.py Triton-kernel variants (slower than FlexAttention; kept for reference)
train_300m_flex.py Production 300M training script (FlexAttention + bs=1 ga=32 single-GPU)
train_fineweb_v75_300m.py, train_fineweb_v75_300m_fsdp2.py Multi-GPU 300M trainers (DeepSpeed ZeRO-2 / FSDP2)
train_fineweb_v75_continue_lowlr.py 75M continuation script
train_fineweb_v75_accelerate.py HuggingFace Accelerate-native trainer
compact_v47b.py Compact .pt.bit2 (deployment format, ~180× smaller)
infer_sample.py, chat_app.py, text_app.py Inference wrappers (call C binary)
infer_v47_v75.c, infer_v47.c C inference kernels (XNOR-popcount + AVX512)
bench_*.py Throughput / quality micro-benchmarks
_test_*.py Correctness tests
*.md Architecture + scaling + bottleneck reports

Key optimizations (vs naive eager)

Optimization Speedup Caveat
FlexAttention with score_mod (75M) 1.6× Requires torch._inductor.ir.FlexibleLayout.allow_indexing = True
FlexAttention memory savings (300M) +11.6% (bs=1→bs=4) Frees ~10 GB peak vs eager
DataLoaderConfiguration(non_blocking=True) +4.2% Free in Accelerate path
Optimizer-state preservation in resume (correctness) Fixes plateau bug

See intern_brief_architecture_optimizations.md for the architecture-specific optimization research questions.

Build C inference

gcc -O3 -march=native -ffast-math -o infer_v47_v75 infer_v47_v75.c -lm

For AVX-512 build use infer_v47_avx512.c.

Train from scratch (single 5090)

python train_300m_flex.py

Defaults: 315M params, bs=1 ga=32, 234,375 steps = 15B tokens. Throughput: ~24K tok/s steady-state. ETA: ~7 days.

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support