Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass
Abstract
CHROMM is a unified framework that jointly estimates cameras, scene point clouds, and human meshes from multi-person multi-view videos using integrated geometric and human priors with improved speed and robustness.
Recent advances in 3D foundation models have led to growing interest in reconstructing humans and their surrounding environments. However, most existing approaches focus on monocular inputs, and extending them to multi-view settings requires additional overhead modules or preprocessed data. To this end, we present CHROMM, a unified framework that jointly estimates cameras, scene point clouds, and human meshes from multi-person multi-view videos without relying on external modules or preprocessing. We integrate strong geometric and human priors from Pi3X and Multi-HMR into a single trainable neural network architecture, and introduce a scale adjustment module to solve the scale discrepancy between humans and the scene. We also introduce a multi-view fusion strategy to aggregate per-view estimates into a single representation at test-time. Finally, we propose a geometry-based multi-person association method, which is more robust than appearance-based approaches. Experiments on EMDB, RICH, EgoHumans, and EgoExo4D show that CHROMM achieves competitive performance in global human motion and multi-view pose estimation while running over 8x faster than prior optimization-based multi-view approaches. Project page: https://nstar1125.github.io/chromm.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AHAP: Reconstructing Arbitrary Humans from Arbitrary Perspectives with Geometric Priors (2026)
- Wid3R: Wide Field-of-View 3D Reconstruction via Camera Model Conditioning (2026)
- Masked Modeling for Human Motion Recovery Under Occlusions (2026)
- Hand3R: Online 4D Hand-Scene Reconstruction in the Wild (2026)
- UniScale: Unified Scale-Aware 3D Reconstruction for Multi-View Understanding via Prior Injection for Robotic Perception (2026)
- DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation (2026)
- PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper