Papers
arxiv:2512.01801

GR-RL: Going Dexterous and Precise for Long-Horizon Robotic Manipulation

Published on Dec 1, 2025
· Submitted by
Xiao Ma
on Dec 2, 2025
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
Zhi Su ,
,
,
,
,
,

Abstract

GR-RL enhances a vision-language-action policy for long-horizon dexterous manipulation through a multi-stage training pipeline that filters, augments, and refines demonstrations using reinforcement learning.

AI-generated summary

We present GR-RL, a robotic learning framework that turns a generalist vision-language-action (VLA) policy into a highly capable specialist for long-horizon dexterous manipulation. Assuming the optimality of human demonstrations is core to existing VLA policies. However, we claim that in highly dexterous and precise manipulation tasks, human demonstrations are noisy and suboptimal. GR-RL proposes a multi-stage training pipeline that filters, augments, and reinforces the demonstrations by reinforcement learning. First, GR-RL learns a vision-language-conditioned task progress, filters the demonstration trajectories, and only keeps the transitions that contribute positively to the progress. Specifically, we show that by directly applying offline RL with sparse reward, the resulting Q-values can be treated as a robust progress function. Next, we introduce morphological symmetry augmentation that greatly improves the generalization and performance of GR-RL. Lastly, to better align the VLA policy with its deployment behaviors for high-precision control, we perform online RL by learning a latent space noise predictor. With this pipeline, GR-RL is, to our knowledge, the first learning-based policy that can autonomously lace up a shoe by threading shoelaces through multiple eyelets with an 83.3% success rate, a task requiring long-horizon reasoning, millimeter-level precision, and compliant soft-body interaction. We hope GR-RL provides a step toward enabling generalist robot foundations models to specialize into reliable real-world experts.

Community

Our VLA-RL model that learns to lace up your shoes :)

·

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Thank you for the hard work! As an undergraduate student who is interested in the topic, I tried to reproduce the method of progress critic, and encountered some problems.

Ratio of failure data to success data
When training the progress critic offline, what is the approximate ratio between failure trajectories and successful trajectories in the dataset? Is it necessary to deliberately balance them, or is it preferable to use the original data distribution?

Design of the discount factor γ (gamma) for critic updates
What range of γ is typically used for the progress critic during offline TD or distributional updates? Is γ adjusted based on task horizon or the specific definition of progress?

Baseline penalty for failure data
For transitions in failure trajectories, is any additional baseline penalty introduced in the reward or progress target, or does the method rely solely on the natural decrease of progress to represent failure signals?

Actor updates during offline progress critic training
During the offline training phase of the progress critic, is the actor completely frozen and used only as a behavior policy for data collection, or is there any form of joint or alternating update with the critic? Or is the data totally from the collection?

How to keep ρ constrained within the [0, 1] range
Specifically, how can ρ be maintained within the [0, 1] interval in scenarios where the reward is not limited to a single terminal frame (i.e., not only reward = 1 at the final step and 0 elsewhere), but instead multiple terminal or near-terminal steps have non-zero rewards, leading to Q-values potentially exceeding 1?

How does Q-chunking help the problem, and may this progress critic also be gained via chunksize=1?

Thank you for your patience and work again.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.01801 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.01801 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.01801 in a Space README.md to link it from this page.

Collections including this paper 1