Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator
Abstract
Uni-ViGU presents a generation-centric approach to unified multimodal video understanding and generation by extending video generation as a foundation through unified flow matching and bidirectional training mechanisms.
Unified multimodal models integrating visual understanding and generation face a fundamental challenge: visual generation incurs substantially higher computational costs than understanding, particularly for video. This imbalance motivates us to invert the conventional paradigm: rather than extending understanding-centric MLLMs to support generation, we propose Uni-ViGU, a framework that unifies video generation and understanding by extending a video generator as the foundation. We introduce a unified flow method that performs continuous flow matching for video and discrete flow matching for text within a single process, enabling coherent multimodal generation. We further propose a modality-driven MoE-based framework that augments Transformer blocks with lightweight layers for text generation while preserving generative priors. To repurpose generation knowledge for understanding, we design a bidirectional training mechanism with two stages: Knowledge Recall reconstructs input prompts to leverage learned text-video correspondences, while Capability Refinement fine-tunes on detailed captions to establish discriminative shared representations. Experiments demonstrate that Uni-ViGU achieves competitive performance on both video generation and understanding, validating generation-centric architectures as a scalable path toward unified multimodal intelligence. Project Page and Code: https://fr0zencrane.github.io/uni-vigu-page/.
Community
Here's an interesting observation that motivated this work: video generation is computationally much more expensive than video understanding. So why do most unified multimodal models start with understanding-focused architectures (like MLLMs) and then try to bolt on generation capabilities? The authors argue we should flip this around — start with a powerful video generator and extend it to handle understanding tasks. It's a bit like saying "if you can write a novel, you probably understand language pretty well already." Project page: https://fr0zencrane.github.io/uni-vigu-page/
Interesting breakdown of this paper on arXivLens: https://arxivlens.com/PaperView/Details/uni-vigu-towards-unified-video-generation-and-understanding-via-a-diffusion-based-video-generator-5526-33b8d679
Covers the executive summary, detailed methodology, and practical applications.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- UniMotion: A Unified Framework for Motion-Text-Vision Understanding and Generation (2026)
- UniCom: Unified Multimodal Modeling via Compressed Continuous Semantic Representations (2026)
- Diffusion Models for Joint Audio-Video Generation (2026)
- SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model (2026)
- Video2LoRA: Unified Semantic-Controlled Video Generation via Per-Reference-Video LoRA (2026)
- Dynin-Omni: Omnimodal Unified Large Diffusion Language Model (2026)
- MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.08121 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper