Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models Paper • 2606.03988 • Published 20 days ago • 124
MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation Paper • 2606.09056 • Published 15 days ago • 6
minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models Paper • 2605.30263 • Published 26 days ago • 59
CoVEBench: Can Video Editing Models Handle Complex Instructions? Paper • 2606.08415 • Published 16 days ago • 49
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer Paper • 2605.15178 • Published May 14 • 91
AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation Paper • 2605.13724 • Published May 13 • 104
LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation Paper • 2605.18739 • Published May 18 • 115
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture Paper • 2605.12500 • Published May 12 • 194
Mode Seeking meets Mean Seeking for Fast Long Video Generation Paper • 2602.24289 • Published Feb 27 • 41
Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance Paper • 2603.02175 • Published Mar 2 • 24
Code2World: A GUI World Model via Renderable Code Generation Paper • 2602.09856 • Published Feb 10 • 201
Urban Socio-Semantic Segmentation with Vision-Language Reasoning Paper • 2601.10477 • Published Jan 15 • 155
Region-Constraint In-Context Generation for Instructional Video Editing Paper • 2512.17650 • Published Dec 19, 2025 • 53
Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation Paper • 2510.01284 • Published Sep 30, 2025 • 37
view article Article You could have designed state of the art positional encoding FL33TW00D-HF • Nov 25, 2024 • 487
JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization Paper • 2503.23377 • Published Mar 30, 2025 • 57