Abstract:Monocular 3D human pose estimation has made progress in recent years. Most of the methods focus on single persons, which estimate the poses in the person-centric coordinates, i.e., the coordinates based on the center of the target person. Hence, these methods are inapplicable for multi-person 3D pose estimation, where the absolute coordinates (e.g., the camera coordinates) are required. Moreover, multi-person pose estimation is more challenging than single pose estimation, due to inter-person occlusion and close human interactions. Existing top-down multi-person methods rely on human detection (i.e., top-down approach), and thus suffer from the detection errors and cannot produce reliable pose estimation in multi-person scenes. Meanwhile, existing bottom-up methods that do not use human detection are not affected by detection errors, but since they process all persons in a scene at once, they are prone to errors, particularly for persons in small scales. To address all these challenges, we propose the integration of top-down and bottom-up approaches to exploit their strengths. Our top-down network estimates human joints from all persons instead of one in an image patch, making it robust to possible erroneous bounding boxes. Our bottom-up network incorporates human-detection based normalized heatmaps, allowing the network to be more robust in handling scale variations. Finally, the estimated 3D poses from the top-down and bottom-up networks are fed into our integration network for final 3D poses. To address the common gaps between training and testing data, we do optimization during the test time, by refining the estimated 3D human poses using high-order temporal constraint, re-projection loss, and bone length regularizations. Our evaluations demonstrate the effectiveness of the proposed method. Code and models are available: <a class="link-external link-https" href="https://github.com/3dpose/3D-Multi-Person-Pose" rel="external noopener nofollow">this https URL</a>.

Joint Human Detection and Head Pose Estimation Via Multistream Networks for RGB-D Videos

Recurrent Volume-Based 3-D Feature Fusion for Real-Time Multiview Object Pose Estimation.

Recurrent Volume-based 3D Feature Fusion for Real-time Multi-view Object Pose Estimation

FDN: Feature Decoupling Network for Head Pose Estimation.

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

Deep Dual Consecutive Network for Human Pose Estimation

Pose Estimation for Swimmers in Video Surveillance

Marker-Less 3d Human Motion Capture With Monocular Image Sequence And Height-Maps

Modelling Human Body Pose for Action Recognition Using Deep Neural Networks

Human Head Pose Estimation Through Temporal Enhanced and Accurate Self-Supervised Depth Prediction

Dual networks based 3D Multi-Person Pose Estimation from Monocular Video

3D Human pose estimation from video via multi-scale multi-level spatial temporal features

Monocular 3D Multi-Person Pose Estimation by Integrating Top-Down and Bottom-Up Networks

Back to the Future: Joint Aware Temporal Deep Learning 3D Human Pose Estimation

A Novel Convolutional Neural Network for Head Detection and Pose Estimation in Complex Environments from Single-Depth Images

DFSTrack: Dual-stream fusion Siamese network for human pose tracking in videos

Detecting and tracking people in real time with RGB-D camera

Fast Human Detection in RGB-D Images Based on Color-Depth Joint Feature Learning.

Detecting Humans in RGB-D Data with CNNs

Real-time human detection and tracking in complex environments using single RGBD camera

Multi-Person 3D Pose Estimation from Multi-View Uncalibrated Depth Cameras