Abstract:Monocular 3D human pose estimation has made progress in recent years. Most of the methods focus on single persons, which estimate the poses in the person-centric coordinates, i.e., the coordinates based on the center of the target person. Hence, these methods are inapplicable for multi-person 3D pose estimation, where the absolute coordinates (e.g., the camera coordinates) are required. Moreover, multi-person pose estimation is more challenging than single pose estimation, due to inter-person occlusion and close human interactions. Existing top-down multi-person methods rely on human detection (i.e., top-down approach), and thus suffer from the detection errors and cannot produce reliable pose estimation in multi-person scenes. Meanwhile, existing bottom-up methods that do not use human detection are not affected by detection errors, but since they process all persons in a scene at once, they are prone to errors, particularly for persons in small scales. To address all these challenges, we propose the integration of top-down and bottom-up approaches to exploit their strengths. Our top-down network estimates human joints from all persons instead of one in an image patch, making it robust to possible erroneous bounding boxes. Our bottom-up network incorporates human-detection based normalized heatmaps, allowing the network to be more robust in handling scale variations. Finally, the estimated 3D poses from the top-down and bottom-up networks are fed into our integration network for final 3D poses. To address the common gaps between training and testing data, we do optimization during the test time, by refining the estimated 3D human poses using high-order temporal constraint, re-projection loss, and bone length regularizations. Our evaluations demonstrate the effectiveness of the proposed method. Code and models are available: <a class="link-external link-https" href="https://github.com/3dpose/3D-Multi-Person-Pose" rel="external noopener nofollow">this https URL</a>.

Diffusion Based Coarse-to-Fine Network for 3D Human Pose and Shape Estimation from Monocular Video

D3PRefiner: A Diffusion-based Denoise Method for 3D Human Pose Refinement

Efficient Multi-person Hierarchical 3D Pose Estimation for Autonomous Driving

Diffusion-Based Pose Refinement and Multi-Hypothesis Generation for 3D Human Pose Estimation

Diffusion-based Pose Refinement and Muti-hypothesis Generation for 3D Human Pose Estimaiton

DiffuPose: Monocular 3D Human Pose Estimation via Denoising Diffusion Probabilistic Model

DiffPose: Toward More Reliable 3D Pose Estimation

3D Human Pose Estimation using Spatio-Temporal Networks with Explicit Occlusion Training

Learning Dynamical Human-Joint Affinity for 3D Pose Estimation in Videos

Di^2Pose: Discrete Diffusion Model for Occluded 3D Human Pose Estimation

Dual networks based 3D Multi-Person Pose Estimation from Monocular Video

Region Deformer Networks for Unsupervised Depth Estimation from Unconstrained Monocular Videos

3D Human pose estimation from video via multi-scale multi-level spatial temporal features

Simplified-attention Enhanced Graph Convolutional Network for 3D human pose estimation

Disentangled Diffusion-Based 3D Human Pose Estimation with Hierarchical Spatial and Temporal Denoiser

Enhanced 3D Human Pose Estimation from Videos by Using Attention-Based Neural Network with Dilated Convolutions

3D-UGCN: A Unified Graph Convolutional Network for Robust 3D Human Pose Estimation from Monocular RGB Images

3D Human Pose, Shape and Texture from Low-Resolution Images and Videos

Deep Dual Consecutive Network for Human Pose Estimation

Exploring Latent Cross-Channel Embedding for Accurate 3D Human Pose Reconstruction in a Diffusion Framework

Graph U-Shaped Network with Mapping-Aware Local Enhancement for Single-Frame 3D Human Pose Estimation