Abstract:Remarkable strides have been made in reconstructing static scenes or human bodies from monocular videos. Yet, the two problems have largely been approached independently, without much synergy. Most visual SLAM methods can only reconstruct camera trajectories and scene structures up to scale, while most HMR methods reconstruct human meshes in metric scale but fall short in reasoning with cameras and scenes. This work introduces Synergistic Camera and Human Reconstruction (SynCHMR) to marry the best of both worlds. Specifically, we design Human-aware Metric SLAM to reconstruct metric-scale camera poses and scene point clouds using camera-frame HMR as a strong prior, addressing depth, scale, and dynamic ambiguities. Conditioning on the dense scene recovered, we further learn a Scene-aware SMPL Denoiser to enhance world-frame HMR by incorporating spatio-temporal coherency and dynamic scene constraints. Together, they lead to consistent reconstructions of camera trajectories, human meshes, and dense scene point clouds in a common world frame. Project page:

What problem does this paper attempt to address?

This paper attempts to solve the problems of reconstructing camera trajectories, human meshes and scene point clouds from monocular videos, while ensuring the consistency of these reconstruction results in the same global coordinate system. Specifically, the paper focuses on the following points: 1. **Depth, Scale and Dynamic Ambiguity**: Existing visual SLAM methods can only recover camera trajectories and scene structures up to a scale, and while most HMR methods are able to reconstruct human meshes at a metric scale, they are deficient in handling the relationship between the camera and the scene. This leads to problems of depth, scale and dynamic ambiguity. 2. **Independent Processing of Static Scenes and Human Reconstruction**: Current methods usually process static scenes or human reconstruction independently, lacking synergy. This independent processing method fails to fully utilize the complementary information between the two, thus affecting the quality and consistency of the overall reconstruction. 3. **Camera Motion Estimation in Dynamic Scenes**: In dynamic scenes, especially when the moving foreground is dominant, traditional monocular SLAM methods have difficulty accurately estimating camera motion. This is mainly because these methods rely on the static key - point assumption, and the presence of dynamic objects breaks this assumption. To overcome the above problems, the paper proposes a new framework named **Synergistic Camera and Human Reconstruction (SynCHMR)**. This framework combines the advantages of HMR and SLAM to achieve the goal of consistently reconstructing camera trajectories, human meshes and dense scene point clouds in the global coordinate system. Specific technical means include: - **Human - aware Metric SLAM**: Use camera - frame HMR as a strong prior to calibrate the estimated depth, solve the problems of depth, scale and dynamic ambiguity, and thus obtain camera poses and scene point clouds at a metric scale. - **Scene - aware SMPL Denoiser**: Based on the recovered dense scene, further learn a scene - aware SMPL denoiser to enhance the world - frame HMR, and improve the reconstruction quality of human meshes by combining spatio - temporal consistency and dynamic scene constraints. Through these methods, SynCHMR can achieve more accurate and consistent reconstruction effects in complex real - world scenes.

Synergistic Global-space Camera and Human Reconstruction from Videos

Two-Stage Multi-Camera Constrain Mapping Pipeline for Large-Scale 3D Reconstruction

Reconstructing People, Places, and Cameras

MUC: Mixture of Uncalibrated Cameras for Robust 3D Human Body Reconstruction

Decoupling Human and Camera Motion from Videos in the Wild

Humans as Checkerboards: Calibrating Camera Motion Scale for World-Coordinate Human Mesh Recovery

Simultaneously Recovering Multi-Person Meshes and Multi-View Cameras with Human Semantics

GLAMR: Global Occlusion-Aware Human Mesh Recovery with Dynamic Cameras

SelfRecon: Self Reconstruction Your Digital Avatar from Monocular Video

High-precision Human Body Acquisition Via Multi-View Binocular Stereopsis

Marker-Less 3d Human Motion Capture With Monocular Image Sequence And Height-Maps

Hybrid-MVS: Robust Multi-View Reconstruction with Hybrid Optimization of Visual and Depth Cues

Dynamic Human Body Reconstruction and Motion Tracking with Low-Cost Depth Cameras

Clip Fusion with Bi-level Optimization for Human Mesh Reconstruction from Monocular Videos

WHAC: World-grounded Humans and Cameras

Human Mesh Recovery from Arbitrary Multi-view Images

Dynamic Multi-Person Mesh Recovery From Uncalibrated Multi-View Cameras

PoseFusion2: Simultaneous Background Reconstruction and Human Shape Recovery in Real-time

UnstructuredFusion: Realtime 4D Geometry and Texture Reconstruction Using Commercial RGBD Cameras.

Dyn-HaMR: Recovering 4D Interacting Hand Motion from a Dynamic Camera

MH‐HMR: Human mesh recovery from monocular images via multi‐hypothesis learning