FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space
Abstract
FSVideo is a fast transformer-based image-to-video diffusion framework that uses a compressed video autoencoder, diffusion transformer architecture with enhanced layer memory, and multi-resolution generation strategy to achieve high performance with significantly reduced computation time.
We introduce FSVideo, a fast speed transformer-based image-to-video (I2V) diffusion framework. We build our framework on the following key components: 1.) a new video autoencoder with highly-compressed latent space (64times64times4 spatial-temporal downsampling ratio), achieving competitive reconstruction quality; 2.) a diffusion transformer (DIT) architecture with a new layer memory design to enhance inter-layer information flow and context reuse within DIT, and 3.) a multi-resolution generation strategy via a few-step DIT upsampler to increase video fidelity. Our final model, which contains a 14B DIT base model and a 14B DIT upsampler, achieves competitive performance against other popular open-source models, while being an order of magnitude faster. We discuss our model design as well as training strategies in this report.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- S2DiT: Sandwich Diffusion Transformer for Mobile Streaming Video Generation (2026)
- OSDEnhancer: Taming Real-World Space-Time Video Super-Resolution with One-Step Diffusion (2026)
- Autoregressive Video Autoencoder with Decoupled Temporal and Spatial Context (2025)
- VideoAR: Autoregressive Video Generation via Next-Frame&Scale Prediction (2026)
- SnapGen++: Unleashing Diffusion Transformers for Efficient High-Fidelity Image Generation on Edge Devices (2026)
- Single-step Diffusion-based Video Coding with Semantic-Temporal Guidance (2025)
- EditYourself: Audio-Driven Generation and Manipulation of Talking Head Videos with Diffusion Transformers (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper