HRPoseFormer: High-Resolution Transformer for Human Pose Estimation Via Multi-Scale Token Aggregation

Xiao-Wei Yu,Geng-Sheng Chen
DOI: https://doi.org/10.1109/icsict55466.2022.9963229
2022-01-01
Abstract:Vision Transformer has a promising application in human pose estimation (HPE). However, due to the use of similar receptive fields of the tokens, existing Transformers are still lacking the ability of dealing with the scale variance. In this paper, we propose a High-Resolution Transformer-based network—HRPoseFormer for HPE. First, we bring in a self-attention module to the parallel multi-resolution structure for a more effective capture of the multi-scale features with lower computing complexity. Second, we utilize detail-specific feed-forward layers to supplement more elaborate local features to the Transformer blocks. Third, we leverage the unbiased data processing (UDP) strategy to help acquire a more accurate transformation between the different coordinate systems and the keypoint formats. Experiments show that the new HRPoseFormer model surpasses the existing state-of-the-art methods with a superior performance of 77.0 AP on the COCO keypoint dataset.
What problem does this paper attempt to address?