SA-VLA: Spatially-Aware Reinforcement Learning for Flow-Matching VLA Models
SA-VLA is a spatially-aware reinforcement learning approach for flow-matching Vision-Language-Action (VLA) models.
It is developed on top of the RLinf framework and targets robust embodied manipulation with stronger spatial generalization.
- 📄 Paper: https://arxiv.org/abs/2602.00743
- 🌐 Project Page: https://xupan.top/Projects/savla
- 🧩 Codebase: https://github.com/TwSphinx54/SA-VLA
- 🏗️ RL Framework: https://github.com/RLinf/RLinf
Model Summary
SA-VLA fuses visual tokens and spatial tokens into geometry-aware embeddings, then optimizes the policy via:
- Step-level dense rewards
- Spatially-conditioned exploration (SCAN)
- RL fine-tuning on embodied benchmarks
This repository provides model weights used in SA-VLA experiments.
Intended Use
- RL fine-tuning and evaluation for embodied manipulation tasks
- Experiments on LIBERO / LIBERO-PLUS style benchmarks
- Research on spatial reasoning in VLA post-training
For complete environment setup, training scripts, and benchmark integration, use the full code repository: https://github.com/TwSphinx54/SA-VLA
Quick Start (with SA-VLA codebase)
1) Clone project
git clone https://github.com/TwSphinx54/SA-VLA.git
cd SA-VLA
2) Setup environment
Follow the RLinf setup in:
README.RLinf.md(framework/environment)scripts/setup_container.sh(extra container setup)
3) Place weights
Put downloaded checkpoints under:
weights/
4) Run training / evaluation
# RL training
bash examples/embodiment/run_embodiment.sh libero_spatial_ppo_openpi_pi05
# Evaluation
bash examples/embodiment/eval_embodiment.sh libero_spatial_ppo_openpi_pi05_eval
Recommended Weight Layout
weights
|-- Pi05-LIBERO
|-- Pi05-VGGT-LIBERO-FUSER-SFT_BF16
`-- RLinf-Pi05-SFT
Dataset Notes
The SA-VLA experiments rely on LIBERO-family data and benchmark configs.
For subset/full-set switching, modify benchmark mapping in your OpenPi LIBERO installation as documented in the main repo.
Limitations
- Requires non-trivial robotics simulation setup
- Performance depends on environment/version consistency
- Not intended for safety-critical real-world deployment without additional validation
Citation
@misc{pan2026savlaspatiallyawareflowmatchingvisionlanguageaction,
title={SA-VLA: Spatially-Aware Flow-Matching for Vision-Language-Action Reinforcement Learning},
author={Xu Pan and Zhenglin Wan and Xingrui Yu and Xianwei Zheng and Youkai Ke and Ming Sun and Rui Wang and Ziwei Wang and Ivor Tsang},
year={2026},
eprint={2602.00743},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2602.00743}
}
License
Apache-2.0
Acknowledgments
Built upon: