Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models
Abstract
Light Interaction accelerates interactive video world models through adaptive computation strategies and optimized attention mechanisms without requiring model retraining.
Interactive video world models generate video chunk by chunk in response to user-controlled camera movements, enabling applications such as real-time game simulation, virtual scene navigation, and embodied AI training. However, scaling to long interactive trajectories is prohibitively expensive due to growing context memory, quadratic attention complexity, and repeated denoising steps. We present Light Interaction, a training-free inference acceleration framework for interactive video world models. Our key insight is that interaction naturally enables trajectory-dependent adaptive computation: retrieved spatial memory can be discarded during novel exploration, temporal context can be adjusted according to local latent dynamics, and early-step model outputs can be reused when the camera revisits familiar regions. Based on this insight, Light Interaction combines adaptive context management, denoising cache acceleration, and hardware-software co-designed 3D block sparse attention with fused Triton kernels. Evaluated on HY-WorldPlay and Matrix-Game-3.0, Light Interaction achieves up to 2.59x speedup without model retraining while maintaining competitive visual quality.
Community
Light Interaction is a training-free inference acceleration framework for interactive video world models that reduces computational costs via adaptive context management and hardware-aware sparse attention.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- OmniMem: Scalable and Adaptive Memory Retrieval for Long Video Generation (2026)
- FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity (2026)
- WorldKV: Efficient World Memory with World Retrieval and Compression (2026)
- Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation (2026)
- Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory (2026)
- X-Cache: Cross-Chunk Block Caching for Few-Step Autoregressive World Models Inference (2026)
- CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.31158 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper