arxiv:2601.21363

Towards Bridging the Gap between Large-Scale Pretraining and Efficient Finetuning for Humanoid Control

Published on Jan 29

Authors:

Weidong Huang ,

Abstract

Off-policy Soft Actor-Critic with large-batch updates enables efficient humanoid locomotion policy pretraining, while model-based methods facilitate safe adaptation through deterministic data collection and stochastic exploration within physics-informed world models.

AI-generated summary

Reinforcement learning (RL) is widely used for humanoid control, with on-policy methods such as Proximal Policy Optimization (PPO) enabling robust training via large-scale parallel simulation and, in some cases, zero-shot deployment to real robots. However, the low sample efficiency of on-policy algorithms limits safe adaptation to new environments. Although off-policy RL and model-based RL have shown improved sample efficiency, the gap between large-scale pretraining and efficient finetuning on humanoids still exists. In this paper, we find that off-policy Soft Actor-Critic (SAC), with large-batch update and a high Update-To-Data (UTD) ratio, reliably supports large-scale pretraining of humanoid locomotion policies, achieving zero-shot deployment on real robots. For adaptation, we demonstrate that these SAC-pretrained policies can be finetuned in new environments and out-of-distribution tasks using model-based methods. Data collection in the new environment executes a deterministic policy while stochastic exploration is instead confined to a physics-informed world model. This separation mitigates the risks of random exploration during adaptation while preserving exploratory coverage for improvement. Overall, the approach couples the wall-clock efficiency of large-scale simulation during pretraining with the sample efficiency of model-based learning during fine-tuning.

View arXiv page View PDF Add to collection

Community

Weidong-Huang

Paper author about 13 hours ago

Real-world Reinforcement Learning on Humanoid Robot

Towards Bridging the Gap between Large-Scale Pretraining and Efficient Finetuning for Humanoid Control

🔗 Project: https://lift-humanoid.github.io/
💻 Code: https://github.com/bigai-ai/LIFT-humanoid

Weidong-Huang

Paper author about 13 hours ago

Humanoids can dance and backflip, but they are still "frozen" in time. 🤖

Current Sim2Real Reinforcement learning (RL) relies on massive Domain Randomization: train in the lab, deploy, and pray. But the moment friction changes or hardware wears down, a star athlete becomes a paperweight.

Why is real-world RL so hard?
1️⃣ Safety: Trial & error = broken hardware.
2️⃣ Efficiency: Real-world data is slow and expensive.

At ICLR 2026, we present LIFT:

Pretrain the policy in simulation with off-policy RL (SAC).
Learn a physics-informed world model from pretraining data.
Real-world finetuning: collect data with a deterministic policy, while pushing stochastic exploration into the world model under constraints — reducing hardware risk and improving sample efficiency.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.21363 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.21363 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.21363 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.