Abstract:In recent years, 2D human pose estimation has made significant progress on public benchmarks. However, many of these approaches face challenges of less applicability in the industrial community due to the large number of parametric quantities and computational overhead. Efficient human pose estimation remains a hurdle, especially for whole-body pose estimation with numerous keypoints. While most current methods for efficient human pose estimation primarily rely on CNNs, we propose the Group-based Token Pruning Transformer (GTPT) that fully harnesses the advantages of the Transformer. GTPT alleviates the computational burden by gradually introducing keypoints in a coarse-to-fine manner. It minimizes the computation overhead while ensuring high performance. Besides, GTPT groups keypoint tokens and prunes visual tokens to improve model performance while reducing redundancy. We propose the Multi-Head Group Attention (MHGA) between different groups to achieve global interaction with little computational overhead. We conducted experiments on COCO and COCO-WholeBody. Compared to other methods, the experimental results show that GTPT can achieve higher performance with less computation, especially in whole-body with numerous keypoints.

What problem does this paper attempt to address?

The paper aims to address the issue of efficient human pose estimation (HPE), particularly the computational efficiency challenges faced when dealing with a large number of keypoints in full-body pose estimation. Specifically: - **Computational Efficiency Issue**: Many existing methods have problems with excessive computational load and a large number of parameters, making practical deployment difficult in industrial applications. - **Distinguishing Dense Keypoint Areas**: When the number of keypoints significantly increases, high-resolution feature maps find it difficult to distinguish dense keypoints, such as facial keypoints. - **Capturing Long-Distance Correlations**: Convolutional Neural Network (CNN)-based methods struggle to capture long-distance correlations between keypoints. To address these issues, the paper proposes a Group-based Token Pruning Transformer (GTPT), which leverages the advantages of transformers and reduces redundancy by gradually introducing keypoints and pruning visual tokens, thereby improving computational efficiency while maintaining high performance. The main contributions of GTPT include: 1. **Gradual Introduction of Keypoints**: Keypoints are introduced gradually from coarse to fine, alleviating the computational burden brought by a large number of keypoints. 2. **Group-based Pruning**: By grouping keypoints and pruning visual tokens, redundancy is further reduced, enhancing model performance. 3. **Multi-Head Group Attention Mechanism**: A Multi-Head Group Attention (MHGA) mechanism is proposed to capture relationships among all keypoints while maintaining low computational overhead. Experimental results show that compared to existing methods, GTPT improves both computational efficiency and performance, particularly excelling in full-body pose estimation.

GTPT: Group-based Token Pruning Transformer for Efficient Human Pose Estimation

Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation

Distilling Token-Pruned Pose Transformer for 2D Human Pose Estimation

PPT: token-Pruned Pose Transformer for monocular and multi-view human pose estimation

Gated Region-Refine Pose Transformer for Human Pose Estimation.

HEViTPose: High-Efficiency Vision Transformer for Human Pose Estimation

HRPoseFormer: High-Resolution Transformer for Human Pose Estimation Via Multi-Scale Token Aggregation

End-to-End Multi-Person Pose Estimation with Transformers.

TKFormer: Typed Keypoints Guided Transformer for Human Parsing

Bilateral Pose Transformer for Human Pose Estimation.

PoseGTAC: Graph Transformer Encoder-Decoder with Atrous Convolution for 3D Human Pose Estimation

Joint graph convolution networks and transformer for human pose estimation in sports technique analysis

HRPVT: High-Resolution Pyramid Vision Transformer for medium and small-scale human pose estimation

DPIT: Dual-Pipeline Integrated Transformer for Human Pose Estimation

Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation

KTPFormer: Kinematics and Trajectory Prior Knowledge-Enhanced Transformer for 3D Human Pose Estimation

Pose-Oriented Transformer with Uncertainty-Guided Refinement for 2D-to-3D Human Pose Estimation

Cross-Space-Time 3D Human Body Pose Estimation Based on Transformer

TokenPose: Learning Keypoint Tokens for Human Pose Estimation

Detecting and Grouping Keypoints for Multi-person Pose Estimation using Instance-Aware Attention

3D Human Pose Estimation with Spatial and Temporal Transformers