Towards Bridging the Gap between Large-Scale Pretraining and Efficient Finetuning for Humanoid Control
Abstract
Off-policy Soft Actor-Critic with large-batch updates enables efficient humanoid locomotion policy pretraining, while model-based methods facilitate safe adaptation through deterministic data collection and stochastic exploration within physics-informed world models.
Reinforcement learning (RL) is widely used for humanoid control, with on-policy methods such as Proximal Policy Optimization (PPO) enabling robust training via large-scale parallel simulation and, in some cases, zero-shot deployment to real robots. However, the low sample efficiency of on-policy algorithms limits safe adaptation to new environments. Although off-policy RL and model-based RL have shown improved sample efficiency, the gap between large-scale pretraining and efficient finetuning on humanoids still exists. In this paper, we find that off-policy Soft Actor-Critic (SAC), with large-batch update and a high Update-To-Data (UTD) ratio, reliably supports large-scale pretraining of humanoid locomotion policies, achieving zero-shot deployment on real robots. For adaptation, we demonstrate that these SAC-pretrained policies can be finetuned in new environments and out-of-distribution tasks using model-based methods. Data collection in the new environment executes a deterministic policy while stochastic exploration is instead confined to a physics-informed world model. This separation mitigates the risks of random exploration during adaptation while preserving exploratory coverage for improvement. Overall, the approach couples the wall-clock efficiency of large-scale simulation during pretraining with the sample efficiency of model-based learning during fine-tuning.
Community
Real-world Reinforcement Learning on Humanoid Robot
Towards Bridging the Gap between Large-Scale Pretraining and Efficient Finetuning for Humanoid Control
๐ Project: https://lift-humanoid.github.io/
๐ป Code: https://github.com/bigai-ai/LIFT-humanoid
Humanoids can dance and backflip, but they are still "frozen" in time. ๐ค
Current Sim2Real Reinforcement learning (RL) relies on massive Domain Randomization: train in the lab, deploy, and pray. But the moment friction changes or hardware wears down, a star athlete becomes a paperweight.
Why is real-world RL so hard?
1๏ธโฃ Safety: Trial & error = broken hardware.
2๏ธโฃ Efficiency: Real-world data is slow and expensive.
At ICLR 2026, we present LIFT:
- Pretrain the policy in simulation with off-policy RL (SAC).
- Learn a physics-informed world model from pretraining data.
- Real-world finetuning: collect data with a deterministic policy, while pushing stochastic exploration into the world model under constraints โ reducing hardware risk and improving sample efficiency.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper