Synergistic Global-space Camera and Human Reconstruction from Videos

Yizhou Zhao,Tuanfeng Y. Wang,Bhiksha Raj,Min Xu,Jimei Yang,Chun-Hao Paul Huang
2024-05-24
Abstract:Remarkable strides have been made in reconstructing static scenes or human bodies from monocular videos. Yet, the two problems have largely been approached independently, without much synergy. Most visual SLAM methods can only reconstruct camera trajectories and scene structures up to scale, while most HMR methods reconstruct human meshes in metric scale but fall short in reasoning with cameras and scenes. This work introduces Synergistic Camera and Human Reconstruction (SynCHMR) to marry the best of both worlds. Specifically, we design Human-aware Metric SLAM to reconstruct metric-scale camera poses and scene point clouds using camera-frame HMR as a strong prior, addressing depth, scale, and dynamic ambiguities. Conditioning on the dense scene recovered, we further learn a Scene-aware SMPL Denoiser to enhance world-frame HMR by incorporating spatio-temporal coherency and dynamic scene constraints. Together, they lead to consistent reconstructions of camera trajectories, human meshes, and dense scene point clouds in a common world frame. Project page:
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve the problems of reconstructing camera trajectories, human meshes and scene point clouds from monocular videos, while ensuring the consistency of these reconstruction results in the same global coordinate system. Specifically, the paper focuses on the following points: 1. **Depth, Scale and Dynamic Ambiguity**: Existing visual SLAM methods can only recover camera trajectories and scene structures up to a scale, and while most HMR methods are able to reconstruct human meshes at a metric scale, they are deficient in handling the relationship between the camera and the scene. This leads to problems of depth, scale and dynamic ambiguity. 2. **Independent Processing of Static Scenes and Human Reconstruction**: Current methods usually process static scenes or human reconstruction independently, lacking synergy. This independent processing method fails to fully utilize the complementary information between the two, thus affecting the quality and consistency of the overall reconstruction. 3. **Camera Motion Estimation in Dynamic Scenes**: In dynamic scenes, especially when the moving foreground is dominant, traditional monocular SLAM methods have difficulty accurately estimating camera motion. This is mainly because these methods rely on the static key - point assumption, and the presence of dynamic objects breaks this assumption. To overcome the above problems, the paper proposes a new framework named **Synergistic Camera and Human Reconstruction (SynCHMR)**. This framework combines the advantages of HMR and SLAM to achieve the goal of consistently reconstructing camera trajectories, human meshes and dense scene point clouds in the global coordinate system. Specific technical means include: - **Human - aware Metric SLAM**: Use camera - frame HMR as a strong prior to calibrate the estimated depth, solve the problems of depth, scale and dynamic ambiguity, and thus obtain camera poses and scene point clouds at a metric scale. - **Scene - aware SMPL Denoiser**: Based on the recovered dense scene, further learn a scene - aware SMPL denoiser to enhance the world - frame HMR, and improve the reconstruction quality of human meshes by combining spatio - temporal consistency and dynamic scene constraints. Through these methods, SynCHMR can achieve more accurate and consistent reconstruction effects in complex real - world scenes.