TransNet: Parallel encoder architecture for human pose estimation

Chenxi Wang,Zinan Xiong,Ying Li,Yan Luo,Yu Cao
DOI: https://doi.org/10.1016/j.smhl.2023.100395
2023-01-01
Smart Health
Abstract:Recently self-attention mechanisms have become increasingly popular for computer vision applications following the success of transformer in natural language processing. Yet, transformer remains under-appreciated compared to the dominant role of convolutional neural networks in the field of computer vision. In this study, we present various approaches for transformers and their application to human pose estimation. We propose a novel model (TransNet) using a convolutional neural network design with a parallel transformer encoder branch to capture the long-range spatial dependency simultaneously while fusing it with the local features extracted from the input images. Experiments results show that TransNet achieves the exceptional performance for human pose estimation on the COCO dataset. Our proposed model outperforms the competitors and achieves the Average Precision (AP) score of 78.3 on COCO val set. Specifically, there is a significant improvement in the average score between the proposed model and the advanced convolutional neural networks. We believe this research can contribute to a better understanding of transformers within computer vision models.
What problem does this paper attempt to address?