Adapting VACE for Real-Time Autoregressive Video Diffusion
This is the companion model card for the paper Adapting VACE for Real-Time Autoregressive Video Diffusion.
Overview
This work presents modifications to VACE that enable real-time autoregressive generation. The original VACE system uses bidirectional attention across full sequences, which is incompatible with streaming requirements. The key innovation moves reference frames from the diffusion latent space into a parallel conditioning pathway, maintaining fixed chunk sizes and KV caching needed for autoregressive models.
The adaptation leverages existing pretrained weights without retraining. Testing across 1.3B and 14B model scales shows structural control adds 20-30% latency overhead with minimal memory costs.
Real-Time Demo
Resolume Arena as live input into Scope via Spout:
VACE Control Examples
These comparisons show the adapted VACE conditioning across different control modes (corresponding to figures in the paper):
| Control Mode | Video |
|---|---|
| Depth | |
| Scribble | |
| Optical Flow | |
| Image-to-Video | |
| Inpainting | |
| Outpainting | |
| Layout |
Reference Implementation
The reference implementation is available in Daydream Scope, a tool for running real-time, interactive generative AI video pipelines.
Author
Citation
@article{fosdick2026adapting,
title={Adapting VACE for Real-Time Autoregressive Video Diffusion},
author={Fosdick, Ryan},
journal={arXiv preprint arXiv:2602.14381},
year={2026}
}