Abstract:We propose a Context-aware Feature Transformer Network (CaFTNet), a novel network for human pose estimation. To address the issue of limited modeling of global dependencies in convolutional neural networks, we design the Transformerneck to strengthen the expressive power of features. Transformerneck directly substitutes 3×3 convolution in the bottleneck of HRNet with a Contextual Transformer (CoT) block while reducing the complexity of the network. Specifically, the CoT first produces keys with static contextual information through 3×3 convolution. Then, relying on query and contextualization keys, dynamic contexts are generated through two concatenated 1×1 convolutions. Static and dynamic contexts are eventually fused as an output. Additionally, for multi-scale networks, in order to further refine the features of the fusion output, we propose an Attention Feature Aggregation Module (AFAM). Technically, given an intermediate input, the AFAM successively deduces attention maps along the channel and spatial dimensions. Then, an adaptive refinement module (ARM) is exploited to activate the obtained attention maps. Finally, the input undergoes adaptive feature refinement through multiplication with the activated attention maps. Through the above procedures, our lightweight network provides powerful clues for the detection of keypoints. Experiments are performed on the COCO and MPII datasets. The model achieves a 76.2 AP on the COCO val2017 dataset. Compared to other methods with a CNN as the backbone, CaFTNet has a 72.9% reduced number of parameters. On the MPII dataset, our method uses only 60.7% of the number of parameters, acquiring similar results to other methods with a CNN as the backbone.

MSRT: multi-scale representation transformer for regression-based human pose estimation

Adaptively Fusing Complete Multi-resolution Features for Human Pose Estimation.

Poseur: Direct Human Pose Regression with Transformers.

TFPose: Direct Human Pose Estimation with Transformers

HRPoseFormer: High-Resolution Transformer for Human Pose Estimation Via Multi-Scale Token Aggregation

Gated Region-Refine Pose Transformer for Human Pose Estimation.

Rethinking on Multi-Stage Networks for Human Pose Estimation

Multi-Scale Supervised Network for Human Pose Estimation

MRSAPose: Multi-level Routing Sparse Attention for Multi-Person Pose Estimation

A Lightweight Context-Aware Feature Transformer Network for Human Pose Estimation

Towards Precise 3D Human Pose Estimation with Multi-Perspective Spatial-Temporal Relational Transformers

Bilateral Pose Transformer for Human Pose Estimation.

3D Human Pose Estimation with Spatial and Temporal Transformers

MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation

HRPVT: High-Resolution Pyramid Vision Transformer for medium and small-scale human pose estimation

Multi-hypothesis Representation Learning for Transformer-Based 3D Human Pose Estimation

Multi-Scale Structure-Aware Network for Human Pose Estimation

Coarse-to-Fine Multi-Scene Pose Regression with Transformers

HEViTPose: High-Efficiency Vision Transformer for Human Pose Estimation

<i>ST<SUP>2</SUP>PE</i>: Spatial and Temporal Transformer for Pose Estimation