Abstract:We consider the problem of recovering a single person's 3D human mesh from in-the-wild crowded scenes. While much progress has been in 3D human mesh estimation, existing methods struggle when test input has crowded scenes. The first reason for the failure is a domain gap between training and testing data. A motion capture dataset, which provides accurate 3D labels for training, lacks crowd data and impedes a network from learning crowded scene-robust image features of a target person. The second reason is a feature processing that spatially averages the feature map of a localized bounding box containing multiple people. Averaging the whole feature map makes a target person's feature indistinguishable from others. We present 3DCrowdNet that firstly explicitly targets in-the-wild crowded scenes and estimates a robust 3D human mesh by addressing the above issues. First, we leverage 2D human pose estimation that does not require a motion capture dataset with 3D labels for training and does not suffer from the domain gap. Second, we propose a joint-based regressor that distinguishes a target person's feature from others. Our joint-based regressor preserves the spatial activation of a target by sampling features from the target's joint locations and regresses human model parameters. As a result, 3DCrowdNet learns target-focused features and effectively excludes the irrelevant features of nearby persons. We conduct experiments on various benchmarks and prove the robustness of 3DCrowdNet to the in-the-wild crowded scenes both quantitatively and qualitatively. The code is available at <a class="link-external link-https" href="https://github.com/hongsukchoi/3DCrowdNet_RELEASE" rel="external noopener nofollow">this https URL</a>.

CrowdRec: 3D Crowd Reconstruction from Single Color Images

Crowd3D: Towards Hundreds of People Reconstruction from a Single Image.

Crowd3D++: Robust Monocular Crowd Reconstruction with Upright Space

Learning to Estimate Robust 3D Human Mesh from In-the-Wild Crowded Scenes

Learning Monocular Regression of 3D People in Crowds via Scene-aware Blending and De-occlusion

Exploring Severe Occlusion: Multi-Person 3D Pose Estimation with Gated Convolution.

Online Global Non-rigid Registration for 3D Object Reconstruction Using Consumer-level Depth Cameras

CrowdHuman: A Benchmark for Detecting Human in a Crowd

Redesigning Multi-Scale Neural Network for Crowd Counting

HumanRecon: Neural Reconstruction of Dynamic Human Using Geometric Cues and Physical Priors.

Reliably Detecting Humans in Crowded and Dynamic Environments Using RGB-D Camera

Learning Multi-Level Density Maps for Crowd Counting.

Super-Resolution Information Enhancement For Crowd Counting

3D Crowd Counting via Geometric Attention-guided Multi-View Fusion

Single-Image Crowd Counting Via Multi-Column Convolutional Neural Network

Human Mesh Reconstruction with Generative Adversarial Networks from Single RGB Images

UNeR3D: Versatile and Scalable 3D RGB Point Cloud Generation from 2D Images in Unsupervised Reconstruction

MUG: Multi-human Graph Network for 3D Mesh Reconstruction from 2D Pose

3D real-time human reconstruction with a single RGBD camera

Rethinking pose estimation in crowds: overcoming the detection information-bottleneck and ambiguity

Rise Of The Indoor Crowd: Reconstruction Of Building Interior View Via Mobile Crowdsourcing