Abstract:In this paper, we study the problem of end-to-end multi-person pose estimation. State-of-the-art solutions adopt the DETR-like framework, and mainly develop the complex decoder, e.g., regarding pose estimation as keypoint box detection and combining with human detection in ED-Pose, hierarchically predicting with pose decoder and joint (keypoint) decoder in PETR. We present a simple yet effective transformer approach, named Group Pose. We simply regard $K$-keypoint pose estimation as predicting a set of $N\times K$ keypoint positions, each from a keypoint query, as well as representing each pose with an instance query for scoring $N$ pose predictions. Motivated by the intuition that the interaction, among across-instance queries of different types, is not directly helpful, we make a simple modification to decoder self-attention. We replace single self-attention over all the $N\times(K+1)$ queries with two subsequent group self-attentions: (i) $N$ within-instance self-attention, with each over $K$ keypoint queries and one instance query, and (ii) $(K+1)$ same-type across-instance self-attention, each over $N$ queries of the same type. The resulting decoder removes the interaction among across-instance type-different queries, easing the optimization and thus improving the performance. Experimental results on MS COCO and CrowdPose show that our approach without human box supervision is superior to previous methods with complex decoders, and even is slightly better than ED-Pose that uses human box supervision. $\href{<a class="link-external link-https" href="https://github.com/Michel-liu/GroupPose-Paddle" rel="external noopener nofollow">this https URL</a>}{\rm Paddle}$ and $\href{<a class="link-external link-https" href="https://github.com/Michel-liu/GroupPose" rel="external noopener nofollow">this https URL</a>}{\rm PyTorch}$ code are available.

DCAPose: Improve One-Stage Multi-Person Pose Estimation with Dynamic Center Assignment

A Compact and Powerful Single-Stage Network for Multi-Person Pose Estimation

AdaptivePose: Human Parts As Adaptive Points

FCPose: Fully Convolutional Multi-Person Pose Estimation with Dynamic Instance-Aware Convolutions

DirectPose: Direct End-to-End Multi-Person Pose Estimation

PoseDet: Fast Multi-Person Pose Estimation Using Pose Embedding

Unsupervised Universal Hierarchical Multi-Person 3D Pose Estimation for Natural Scenes

InsPose: Instance-Aware Networks for Single-Stage Multi-Person Pose Estimation

Rethinking on Multi-Stage Networks for Human Pose Estimation

Center point to pose: Multiple views 3D human pose estimation for multi-person

Multi-person Pose Estimation Based on Graph Grouping Optimization

Densely Connected Attentional Pyramid Residual Network for Human Pose Estimation.

DP-Pose: Multi-Person Pose Estimation in Video Sequence Through Dynamic Programming

Single-Stage Multi-Person Pose Machines

Learning Local-Global Contextual Adaptation for Multi-Person Pose Estimation

Single-view Multi-Human Pose Estimation by Attentive Cross-Dimension Matching.

SD-Pose: facilitating space-decoupled human pose estimation via adaptive pose perception guidance

A Deconvolutional Bottom-up Deep Network for Multi-Person Pose Estimation.

Multi-Person 3D Pose Esitmation with Occlusion Reasoning

Cascaded Pyramid Network for Multi-Person Pose Estimation

Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation