Bilateral Pose Transformer for Human Pose Estimation.

Chia-Chen Yen,Tao Pin,Hongmin Xu
DOI: https://doi.org/10.1145/3532342.3532346
2022-01-01
Abstract:Human Pose is a well-defined fundamental task researched by the computer vision community for years. Previous Convolutional Neural Network (CNN) based works have achieved significant success in the human pose. Recently, Vision Transformer (VT) has shown superior performance on computer vision tasks. However, current VT methods emphasize local information less and often focus on only a single scale feature that may not be suitable for dense image prediction tasks, which essentially requires multi-scale representations. In this paper, we propose a novel Bilateral Pose Transformer (BPT) framework to handle the human pose. Specifically, BPT consists of an innovated bilateral branch encoder and a multi-scale integrating decoder. The bilateral branch encoder contains a Context Branch (CB) and Spatial Branch (SB). The CB involves a VT-based backbone to capture the context clues and produce multi-scale context features. The CNN-based SB maintains high-resolution representations containing rich spatial information to introduce the local spatial information that supplements the CB explicitly. About the decoder, a Mixed Feature Module consisting of local attention CNN is proposed to integrate the various-scale context and spatial features effectively. Experiments demonstrate that our approach achieves competitive performances in human pose estimation. Specifically, compared to the HRNet [1], the BPT saves 43% GFLOPs and drops only 0.1 points AP, achieving 75.7% AP with 9.0 GFLOPs, on the COCO keypoints dataset.
What problem does this paper attempt to address?