Abstract:Multi-person pose estimation generally follows top-down and bottom-up paradigms. The top-down paradigm detects all human boxes and then performs single-person pose estimation on each ROI. The bottom-up paradigm locates identity-free keypoints and then groups them into individuals. Both of them use an extra stage to build the relationship between human instance and corresponding keypoints (e.g., human detection in a top-down manner or a grouping process in a bottom-up manner). The extra stage leads to a high computation cost and a redundant two-stage pipeline. To address the above issue, we introduce a fine-grained body representation method. Concretely, the human body is divided into several local parts and each part is represented by an adaptive point. The novel body representation is able to sufficiently encode the diverse pose information and effectively model the relationship between human instance and corresponding keypoints in a single-forward pass. With the proposed body representation, we further introduce a compact single-stage multi-person pose regression network, called AdaptivePose++, which is the extended version of AAAI-22 paper AdaptivePose. During inference, our proposed network only needs a single-step decode operation to estimate the multi-person pose without complex post-processes and refinements. Without any bells and whistles, we achieve the most competitive performance on representative 2D pose estimation benchmarks MS COCO and CrowdPose in terms of accuracy and speed. In particular, AdaptivePose++ outperforms the state-of-the-art SWAHR-W48 and CenterGroup-W48 by 3.2 AP and 1.4 AP on COCO mini-val with faster inference speed. Furthermore, the outstanding performance on 3D pose estimation datasets MuCo-3DHP and MuPoTS-3D further demonstrates its effectiveness and generalizability on 3D scenes.

A Deep Structure for Human Pose Estimation

Human Pose Estimation Using Deep Structure Guided Learning.

Multi-Scale Structure-Aware Network for Human Pose Estimation

Thin-Slicing Network: A Deep Structured Model for Pose Estimation in Videos

LatentHuman: Shape-and-Pose Disentangled Latent Representation for Human Bodies

Human Pose Estimation Based on Parallel Atrous Convolution and Body Structure Constraints

Adversarial PoseNet: A Structure-aware Convolutional Network for Human Pose Estimation

Monocular 3D Human Pose Estimation by Predicting Depth on Joints

Not All Parts Are Created Equal: 3D Pose Estimation by Modelling Bi-directional Dependencies of Body Parts

Latent Variable Pictorial Structure for Human Pose Estimation on Depth Images

Not All Parts Are Created Equal: 3D Pose Estimation by Modeling Bi-Directional Dependencies of Body Parts

Human Pose Estimation from Depth Images via Inference Embedded Multi-task Learning

Human Pose as Compositional Tokens

Deep Dual Consecutive Network for Human Pose Estimation

Detailed Human Shape Estimation From A Single Image By Hierarchical Mesh Deformation

Combining detailed appearance and multi-scale representation: a structure-context complementary network for human pose estimation

Semi-Dynamic Hypergraph Neural Network for 3D Pose Estimation

Body Structure Constraint for 3D Human Pose Estimation

AnatPose: Bidirectionally learning anatomy-aware heatmaps for human pose estimation

A Compact and Powerful Single-Stage Network for Multi-Person Pose Estimation

Double-chain Constraints for 3D Human Pose Estimation in Images and Videos