GTPT: Group-based Token Pruning Transformer for Efficient Human Pose Estimation

Haonan Wang,Jie Liu,Jie Tang,Gangshan Wu,Bo Xu,Yanbing Chou,Yong Wang
2024-07-16
Abstract:In recent years, 2D human pose estimation has made significant progress on public benchmarks. However, many of these approaches face challenges of less applicability in the industrial community due to the large number of parametric quantities and computational overhead. Efficient human pose estimation remains a hurdle, especially for whole-body pose estimation with numerous keypoints. While most current methods for efficient human pose estimation primarily rely on CNNs, we propose the Group-based Token Pruning Transformer (GTPT) that fully harnesses the advantages of the Transformer. GTPT alleviates the computational burden by gradually introducing keypoints in a coarse-to-fine manner. It minimizes the computation overhead while ensuring high performance. Besides, GTPT groups keypoint tokens and prunes visual tokens to improve model performance while reducing redundancy. We propose the Multi-Head Group Attention (MHGA) between different groups to achieve global interaction with little computational overhead. We conducted experiments on COCO and COCO-WholeBody. Compared to other methods, the experimental results show that GTPT can achieve higher performance with less computation, especially in whole-body with numerous keypoints.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the issue of efficient human pose estimation (HPE), particularly the computational efficiency challenges faced when dealing with a large number of keypoints in full-body pose estimation. Specifically: - **Computational Efficiency Issue**: Many existing methods have problems with excessive computational load and a large number of parameters, making practical deployment difficult in industrial applications. - **Distinguishing Dense Keypoint Areas**: When the number of keypoints significantly increases, high-resolution feature maps find it difficult to distinguish dense keypoints, such as facial keypoints. - **Capturing Long-Distance Correlations**: Convolutional Neural Network (CNN)-based methods struggle to capture long-distance correlations between keypoints. To address these issues, the paper proposes a Group-based Token Pruning Transformer (GTPT), which leverages the advantages of transformers and reduces redundancy by gradually introducing keypoints and pruning visual tokens, thereby improving computational efficiency while maintaining high performance. The main contributions of GTPT include: 1. **Gradual Introduction of Keypoints**: Keypoints are introduced gradually from coarse to fine, alleviating the computational burden brought by a large number of keypoints. 2. **Group-based Pruning**: By grouping keypoints and pruning visual tokens, redundancy is further reduced, enhancing model performance. 3. **Multi-Head Group Attention Mechanism**: A Multi-Head Group Attention (MHGA) mechanism is proposed to capture relationships among all keypoints while maintaining low computational overhead. Experimental results show that compared to existing methods, GTPT improves both computational efficiency and performance, particularly excelling in full-body pose estimation.