PersistWorld

Project Page | GitHub | Paper (ArXiv)

Authors: Jai Bardhan, Patrik Drozdík, Josef Šivic, Vladimír Petrík
Affiliation: Czech Institute of Informatics, Robotics and Cybernetics (CIIRC), Czech Technical University in Prague


Teaser

Overview

PersistWorld stabilizes long-horizon robot world model rollouts via RL post-training on autoregressive outputs.

Action-conditioned robot world models generate future video frames given a robot action sequence, but break down during long-horizon autoregressive deployment: each predicted clip feeds back as context for the next, causing errors to compound and visual quality to rapidly degrade — a phenomenon known as the closed-loop gap. We present PersistWorld, an RL post-training framework that trains the world model directly on its own autoregressive rollouts, using contrastive denoising with multi-view perceptual rewards.

Method

Method

PersistWorld addresses the closed-loop gap via online RL post-training. Rather than training on ground-truth history, we run the model autoregressively on its own outputs, branch K=16 candidate continuations from a shared rollout history, rank them with multi-view perceptual rewards (LPIPS, SSIM, PSNR), and update lightweight LoRA adapters and the action encoder using a contrastive denoising objective.

Results

PersistWorld establishes a new state-of-the-art on the DROID dataset over 14-step autoregressive rollouts (≈11 s):

Cameras Model SSIM ↑ PSNR ↑ LPIPS ↓
External WPE 0.77 20.33 0.131
External IRASim 0.77 21.36 0.117
External Ctrl-World (paper) 0.83 23.56 0.091
External Ctrl-World (repro) 0.84 23.02 0.081
External PersistWorld (Ours) 0.86 24.42 0.070
Wrist Ctrl-World (repro) 0.62 17.80 0.310
Wrist PersistWorld (Ours) 0.67 19.39 0.277
  • LPIPS reduced by 14% on external cameras
  • SSIM improved by 9.1% on the wrist camera
  • Wins ~98% of paired comparisons ($p < 10^{-6}$)
  • 80% preference rate in a blind human study (n=200)

Model Details

This checkpoint is a fine-tuned version of Ctrl-World, which is itself built on top of Stable Video Diffusion (SVD).

The weights file contains the merged UNet + action adapter weights. At inference time, the SVD pipeline is loaded from the HuggingFace Hub and this checkpoint is applied on top.

Training data: DROID robot manipulation dataset.

Usage

git clone https://github.com/Jai2500/PersistWorld
cd PersistWorld
pip install -r requirements.txt

# Download checkpoint
huggingface-cli download jaibrdhn/persistworld checkpoint-5760-merged.pt \
    --local-dir model_ckpt/

# Run rollout
PYTHONPATH=. python scripts/rollout_replay_traj.py \
    --dataset_root_path dataset_example \
    --dataset_meta_info_path dataset_meta_info \
    --dataset_names droid_subset \
    --ckpt_path model_ckpt/checkpoint-5760-merged.pt

See the GitHub repository for full installation and training instructions.

Citation

@inproceedings{bardhan2026persistworld,
  title     = {PersistWorld: Stabilizing Multi-step Robot World Model Rollouts
               via Reinforcement Learning},
  author    = {Bardhan, Jai and Drozd\'{i}k, Patrik and \v{S}ivic, Josef
               and Petr\'{i}k, Vladim\'{i}r},
  booktitle = {ArXiv Preprint},
  year      = {2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Model tree for jaibrdhn/persistworld

Finetuned
yjguo/Ctrl-World
Finetuned
(1)
this model

Dataset used to train jaibrdhn/persistworld

Paper for jaibrdhn/persistworld