MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE
Abstract
MotionCrafter is a video diffusion framework that jointly reconstructs 4D geometry and estimates dense motion using a novel joint representation and 4D VAE architecture.
We introduce MotionCrafter, a video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense motion from a monocular video. The core of our method is a novel joint representation of dense 3D point maps and 3D scene flows in a shared coordinate system, and a novel 4D VAE to effectively learn this representation. Unlike prior work that forces the 3D value and latents to align strictly with RGB VAE latents-despite their fundamentally different distributions-we show that such alignment is unnecessary and leads to suboptimal performance. Instead, we introduce a new data normalization and VAE training strategy that better transfers diffusion priors and greatly improves reconstruction quality. Extensive experiments across multiple datasets demonstrate that MotionCrafter achieves state-of-the-art performance in both geometry reconstruction and dense scene flow estimation, delivering 38.64% and 25.0% improvements in geometry and motion reconstruction, respectively, all without any post-optimization. Project page: https://ruijiezhu94.github.io/MotionCrafter_Page
Community
๐ Excited to share our latest work MotionCrafter!
๐ The first Video Diffusion-based framework for joint geometry and motion estimation.
๐ Paper: http://arxiv.org/abs/2602.08961
๐ Project page: https://ruijiezhu94.github.io/MotionCrafter_Page
๐ป Code: https://github.com/TencentARC/MotionCrafter
๐ค HF Models: https://huggingface.co/TencentARC/MotionCrafter
๐ Both training and inference code are provided!
๐ Feedback and discussions are very welcome!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction (2026)
- Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video (2026)
- Motion 3-to-4: 3D Motion Reconstruction for 4D Synthesis (2026)
- TrajVG: 3D Trajectory-Coupled Visual Geometry Learning (2026)
- LaVR: Scene Latent Conditioned Generative Video Trajectory Re-Rendering using Large 4D Reconstruction Models (2026)
- Joint Geometry-Appearance Human Reconstruction in a Unified Latent Space via Bridge Diffusion (2026)
- DePT3R: Joint Dense Point Tracking and 3D Reconstruction of Dynamic Scenes in a Single Forward Pass (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper