SA-VLA: Spatially-Aware Reinforcement Learning for Flow-Matching VLA Models

SA-VLA is a spatially-aware reinforcement learning approach for flow-matching Vision-Language-Action (VLA) models.
It is developed on top of the RLinf framework and targets robust embodied manipulation with stronger spatial generalization.


Model Summary

SA-VLA fuses visual tokens and spatial tokens into geometry-aware embeddings, then optimizes the policy via:

  1. Step-level dense rewards
  2. Spatially-conditioned exploration (SCAN)
  3. RL fine-tuning on embodied benchmarks

This repository provides model weights used in SA-VLA experiments.


Intended Use

  • RL fine-tuning and evaluation for embodied manipulation tasks
  • Experiments on LIBERO / LIBERO-PLUS style benchmarks
  • Research on spatial reasoning in VLA post-training

For complete environment setup, training scripts, and benchmark integration, use the full code repository: https://github.com/TwSphinx54/SA-VLA


Quick Start (with SA-VLA codebase)

1) Clone project

git clone https://github.com/TwSphinx54/SA-VLA.git
cd SA-VLA

2) Setup environment

Follow the RLinf setup in:

  • README.RLinf.md (framework/environment)
  • scripts/setup_container.sh (extra container setup)

3) Place weights

Put downloaded checkpoints under:

weights/

4) Run training / evaluation

# RL training
bash examples/embodiment/run_embodiment.sh libero_spatial_ppo_openpi_pi05

# Evaluation
bash examples/embodiment/eval_embodiment.sh libero_spatial_ppo_openpi_pi05_eval

Recommended Weight Layout

weights
|-- Pi05-LIBERO
|-- Pi05-VGGT-LIBERO-FUSER-SFT_BF16
`-- RLinf-Pi05-SFT

Dataset Notes

The SA-VLA experiments rely on LIBERO-family data and benchmark configs.
For subset/full-set switching, modify benchmark mapping in your OpenPi LIBERO installation as documented in the main repo.


Limitations

  • Requires non-trivial robotics simulation setup
  • Performance depends on environment/version consistency
  • Not intended for safety-critical real-world deployment without additional validation

Citation

@misc{pan2026savlaspatiallyawareflowmatchingvisionlanguageaction,
  title={SA-VLA: Spatially-Aware Flow-Matching for Vision-Language-Action Reinforcement Learning},
  author={Xu Pan and Zhenglin Wan and Xingrui Yu and Xianwei Zheng and Youkai Ke and Ming Sun and Rui Wang and Ziwei Wang and Ivor Tsang},
  year={2026},
  eprint={2602.00743},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  url={https://arxiv.org/abs/2602.00743}
}

License

Apache-2.0


Acknowledgments

Built upon:

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Paper for SSSSphinx/SA-VLA