Training low-bit ternary models with Axolotl

Community Article Published April 30, 2026

Authors: Axolotl team, Younes Belkada of FalconLLM team

Models: https://huggingface.co/collections/axolotl-ai-co/falcon-e-bitnet

Recent work such as Bonsai-1bit showed strong interests in producing low-bit LLMs for the community, to enable more capabilities that can be run on edge devices and constrained resources. Throughout this collaboration, we focused on making 1.58-bit (ternary LLMs) training more accessible to the community, by integrating the training of TII's Falcon BitNet series of models into axolotl. We also release to the community experimental models, trained with a pure SFT stage using axolotl starting from prequantized and bfloat16 versions of existing ternary format LLMs, as well as DPO-finetuned variants, in order to demonstrate the feasibility of BitNet fine-tuning to the community.

What is BitNet (ternary format LLMs)?

Bitnet has been introduced by Microsoft in the paper The Era of 1-bit LLMs: All Large Language Models are in 1.58 bits back in 2024, where the idea is to simply train models to become resilient to ternary format quantization (i.e. weights being either -1, 0 or 1). This is achieved by injecting quantization errors during training into the model weights and activations, applied to all linear layers (except the lm head, which is sensitive to quantization).

The trained models will have ternary weights, with only a single scaling factor per tensor and achieves up to 7x memory reduction (depending on the vocab size of the model) compared to its bfloat16 counterpart.

The training happens in bfloat16 as activations and gradients are always calculated in that precision. In simple words, ternary quantization is 'simulated' during training to make the model compatible with ternary quantization during inference.

The native PyTorch code of Bitnet fake quantization applied to linear layers. In practice, optimized triton kernels are used to perform these operations.

def weight_quant_torch(w):
    scale = 1.0 / w.abs().mean().clamp_(min=1e-05)
    u = (w * scale).round().clamp_(-1, 1) / scale
    return u

def activation_quant_torch(x):
    scale = 127.0 / x.abs().max(dim=-1, keepdim=True).values.clamp_(min=1e-05)
    y = (x * scale).round().clamp_(-128, 127) / scale
    return y

Note: Injecting these formulas in the forward pass will imply non-differentiable operations (torch.clamp, ...) — this is bypassed by using straight-through estimators.

1.58bit comes from the fact that theoretically it is possible to pack the ternary weights using an average number of 1.58 bits per parameter, if certain conditions are met. In practice, ternary weights are packed in 2-bit precision, using uint8 tensors (i.e., 4 params per tensor).

This blogpost will focus on fine-tuning Falcon-E series of Bitnet models, more details about this model series can be read in the attached link.

BitNet support in the LLM ecosystem?

BitNet models have relatively strong support in the ecosystem for CPU inference.

llama.cpp and ik-llama.cpp supports TQ2_0 and TQ1_0 quantization formats (ternary weights packed respectively in 2-bit and 1.58bits precision), Apple's mlx also supports Bitnet models through optimized Metal kernels and it is also possible to perform inference using Hugging Face transformers library through torch.compile.

Although Microsoft recently released optimized GPU kernels for Bitnet models, the support for optimized inference on GPUs is not there yet on popular serving frameworks such as vLLM or sglang.

How to train bitnet models with axolotl?

The safest way to train a bitnet model is to either pretrain from scratch models using bitnet architecture, or take an existing bitnet model — prequantized (i.e. the pure bfloat16 state dict of the checkpoint) — and perform continuous pre-training or fine-tuning of that checkpoint.

The practice of sharing prequantized bitnet weights has been done through recent releases such as Falcon-E, or Microsoft's most recent Bitnet-2B release.

After training, the checkpoints can be safely converted to ternary format using a simple checkpoint conversion heuristic, and pack the ternary weight into uint8 tensors. This can be easily achieved with the quantize_to_1bit utility function from onebitllms library from TII:

# uv pip install onebitllms
onebitllms quantize_to_1bit INPUT_PATH OUTPUT_PATH

Training Falcon-E models with axolotl

First, make sure to install the onebitllms package and use one of the checkpoints below for fine-tuning the Falcon-E models:

  • tiiuae/Falcon-E-1B-prequantized-bf16
  • tiiuae/Falcon-E-3B-prequantized-bf16
  • tiiuae/Falcon3-10B-1.58bit-prequantized-bf16 (experimental)

The checkpoint tiiuae/Falcon3-10B-1.58bit-prequantized-bf16 is experimental, as the original prequantized checkpoint was not released, therefore we have approximated the prequantized weights by injecting the scales into the ternary weights and saved the model in bfloat16.

Inside your axolotl config, enable Bitnet fine-tuning with the following flag:

...
use_onebitllms: true

And that's it! You can experiment with your own datasets and hyper-parameters to build your own ternary LLM, using the axolotl training CLI. We also expose an example config in examples/bitnet/falcon-e-1b.yaml which will use onebitllms and FSDP under the hood.

axolotl train examples/bitnet/falcon-e-1b.yaml

Once the training is finished, make sure to convert the checkpoints in ternary format using the quantize_to_1bit utility function from onebitllms:

onebitllms quantize_to_1bit INPUT_PATH OUTPUT_PATH

The saved checkpoint is compatible out of the box with HuggingFace transformers and Apple MLX libraries. To generate the GGUF files for llama.cpp, make sure to convert the HF checkpoint using convert_hf_to_gguf.py script from that repository, in bfloat16 format (llama.cpp takes automatically care of injecting the scales and convert the ternary weights into bfloat16 format) and use llama-quantize tool to quantize the bfloat16 checkpoints in TQ2_0 or TQ1_0.

Note: for Bitnet fine-tuning, currently only full finetuning is supported — enabling LoRA for bitnet models remain an unexplored area of research.

Training any arbitrary model with it would technically be possible, at your own risk!

Falcon-E-3B-1.2-Exp / Falcon-E-10B-1.2-Exp — how to use them

We also release the prequantized bfloat16 checkpoint of our fine-tuned models (SFT and DPO) so that the community can explore further fine-tuning these models. Simply enable use_onebitllms in your axolotl config and go ahead.

Future work

In the future, an interesting area of exploration would be to apply on-policy RL methods to BitNet models. For off-policy training (e.g. DPO), there is nothing much to worry about, however for on-policy RL, we need to consider a trick during rollout stages, which is to approximate the BitNet model with its bfloat16 version which consists of the ternary model with the scales being injected.

We also hope to see more support in GPU ecosystem, through serving frameworks such as vLLM or sglang of the Bitnet GPU optimized kernels.

Community

Sign up or log in to comment