EnsembleVLA โ€” Released Checkpoints

Released checkpoints for EnsembleVLA: Ensemble Learning for Vision-Language Action Models.

EnsembleVLA is an energy-based framework for principled composition of diverse Vision-Language-Action (VLA) policies. It formulates diffusion-based and flow-based VLA models under a unified energy perspective, where additive energy aggregation induces policy composition at the distribution level. Multiple pretrained policies stay frozen while a lightweight ensemble head with learnable composition weights and confidence-aware gating aggregates them into a stronger policy, evaluated on the RoboTwin2 rollout interface.

What's in this repository

Two released composition families, each over 8 RoboTwin2 tasks. For every task we release the lightweight ensemble head plus the two frozen base policies:

Family Base policy 1 Base policy 2
dp+dp3 Diffusion Policy (DP) 3D Diffusion Policy (DP3)
dp+pi0.5 Diffusion Policy (DP) pi0.5 / openpi

Tasks: beat_block_hammer, click_alarmclock, dump_bin_bigbin, handover_block, move_playingcard_away, open_laptop, place_bread_skillet, stack_bowls_three.

Repository layout

Files live at the repository root and mirror the code's best_checkpoint/ layout:

dp+dp3/<task>/
  ensemble_checkpoint/best.pt          # lightweight EnsembleVLA head
  base_dp/<ckpt>.ckpt                  # frozen DP base policy
  base_dp3/<ckpt>.ckpt                 # frozen DP3 base policy
dp+pi0.5/<task>/
  ensemble_checkpoint/best.pt          # lightweight EnsembleVLA head
  base_dp/<ckpt>.ckpt                  # frozen DP base policy
  base_pi05_checkpoint_dir/
    model.safetensors                  # frozen pi0.5 base policy weights
    metadata.pt
    assets/<task>/norm_stats.json

The pi0.5 base needs all three of model.safetensors, metadata.pt, and assets/ from the same base_pi05_checkpoint_dir/.

Download

Download everything straight into the code repo's best_checkpoint/ directory:

pip install -U huggingface_hub
huggingface-cli download mingchens/EnsembleVLA --repo-type model --local-dir best_checkpoint

Then follow the Environment Setup and Evaluation instructions in the GitHub README. Only inference checkpoints are required for evaluation; optimizer states and training/rollout logs are not included. The full checkpoint manifest (per-task base checkpoints and results) is in docs/checkpoints.md.

License

Released under the MIT License.

Citation

@inproceedings{song2026ensemblevla,
  title={EnsembleVLA: Ensemble Learning for Vision-Language Action Models},
  author={Song, Mingchen and Deng, Xiang and Wei, Jie and Jiang, Dongmei and Nie, Liqiang and Guan, Weili},
  booktitle={International Conference on Machine Learning},
  year={2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading