BitVLA-bf16
BitVLA-bf16 is an open, highly efficient 1-bit vision-language-action (VLA) model trained on ~1M robot manipulation episodes from the Open X-Embodiment dataset. Across simulation benchmarks and real-world tasks, BitVLA matches the performance of the full-precision OpenVLA-OFT baseline, while reducing model memory by 11.0× and end-to-end latency by 4.4×. All BitVLA checkpoints and our training codebase are released under an MIT License. For full details, please read our paper and see our GitHub Repository.
Note that we provide the master weights of BitVLA and perform online quantization. For actual memory savings, you may quantize the weights offline to 1.58-bit precision. We recommend using the bitnet.cpp inference framework to accurately measure the reduction in inference cost. A dedicated inference framework and model are coming soon.
| Models | Size | Memory Usage↓ | LIBERO-Spatial | LIBERO-Object | LIBERO-Goal | LIBERO-Long | Avg. |
|---|---|---|---|---|---|---|---|
| Large Models | |||||||
| OpenVLA | 7.5B | 15.1GB (10.79×) | 84.7 | 88.4 | 79.2 | 53.7 | 76.5 |
| CoT-VLA | 8.0B | 16.2GB (11.57×) | 87.5 | 91.6 | 87.6 | 69.0 | 81.1 |
| UniVLA | 8.5B | 17.0GB (12.14×) | 96.5 | 96.8 | 95.6 | 92.0 | 95.2 |
| UnifiedVLA | 8.5B | 17.0GB (12.14×) | 95.4 | 98.8 | 93.6 | 94.0 | 95.5 |
| OpenVLA-OFT | 7.7B | 15.4GB (11.00×) | 97.6 | 98.4 | 97.9 | 94.5 | 97.1 |
| Small Models | |||||||
| SpatialVLA | 4.2B | 8.5GB (6.07×) | 88.2 | 89.9 | 78.6 | 55.5 | 78.1 |
| NORA-Long | 3.8B | 7.5GB (5.36×) | 92.2 | 95.4 | 89.4 | 74.6 | 87.9 |
| 4D-VLA | 4.1B | 8.3GB (5.93×) | 88.9 | 95.2 | 90.9 | 79.1 | 88.6 |
| SmolVLA | 2.3B | 4.6GB (3.29×) | 93.0 | 94.0 | 91.0 | 77.0 | 88.8 |
| GROOT-N1 | 2.2B | 4.4GB (3.14×) | 94.4 | 97.6 | 93.0 | 90.6 | 93.9 |
| π₀ | 3.5B | 7.0GB (5.00×) | 96.8 | 98.8 | 95.8 | 85.2 | 94.2 |
| BitVLA w/o pre-training | 3.0B | 1.4GB (1.00×) | 97.4 | 99.6 | 94.4 | 87.6 | 94.8 |
| 🚀BitVLA | 3.0B | 1.4GB (1.00×) | 96.6 | 99.0 | 95.4 | 92.8 | 96.0 |
Pre-training Details:
- Base model: We use hongyuw/bitvla-bitsiglipL-224px-bf16 as the base model.
- Dataset: Following OpenVLA, we use a curated large-scale corpus based on a subset of the Open X-Embodiment dataset, resulting in ~1M training samples.
- Hyperparameters: We train the model for 200K steps with a total batch size of 2048. The peak learning rates are set to 3×10⁻⁴ for the LLM and 1×10⁻⁴ for the ViT.
- Compute: The full pre-training takes approximately 14 days on 16 NVIDIA H800 (80GB) GPUs.
Citation
If you find this repository useful, please consider citing our work:
@article{bitvla,
title={BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation},
author={Hongyu Wang and Chuyan Xiong and Ruiping Wang and Xilin Chen},
year={2025},
eprint={2506.07530},
archivePrefix={arXiv},
primaryClass={cs.RO},
}
- Downloads last month
- 51