Abstract:We propose a Context-aware Feature Transformer Network (CaFTNet), a novel network for human pose estimation. To address the issue of limited modeling of global dependencies in convolutional neural networks, we design the Transformerneck to strengthen the expressive power of features. Transformerneck directly substitutes 3×3 convolution in the bottleneck of HRNet with a Contextual Transformer (CoT) block while reducing the complexity of the network. Specifically, the CoT first produces keys with static contextual information through 3×3 convolution. Then, relying on query and contextualization keys, dynamic contexts are generated through two concatenated 1×1 convolutions. Static and dynamic contexts are eventually fused as an output. Additionally, for multi-scale networks, in order to further refine the features of the fusion output, we propose an Attention Feature Aggregation Module (AFAM). Technically, given an intermediate input, the AFAM successively deduces attention maps along the channel and spatial dimensions. Then, an adaptive refinement module (ARM) is exploited to activate the obtained attention maps. Finally, the input undergoes adaptive feature refinement through multiplication with the activated attention maps. Through the above procedures, our lightweight network provides powerful clues for the detection of keypoints. Experiments are performed on the COCO and MPII datasets. The model achieves a 76.2 AP on the COCO val2017 dataset. Compared to other methods with a CNN as the backbone, CaFTNet has a 72.9% reduced number of parameters. On the MPII dataset, our method uses only 60.7% of the number of parameters, acquiring similar results to other methods with a CNN as the backbone.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the limited ability of convolutional neural networks (CNNs) to model global dependencies in human pose estimation. Specifically, due to the limitations of their receptive fields, traditional CNNs are unable to capture long - range interaction information, which leads to poor performance in handling occluded key points. To address these issues, the authors propose a lightweight Context - aware Feature Transformer Network (CaFTNet) to enhance feature representation ability and reduce the number of model parameters. ### Specific description of the problem 1. **Insufficient global dependency modeling**: The receptive fields of traditional CNNs are limited, and they cannot effectively capture the global dependency relationships in images. This limitation is particularly evident when dealing with complex human poses. 2. **Difficulty in detecting occluded key points**: Due to the diversity and complexity of human poses, some key points may be occluded, and traditional methods are difficult to accurately detect these occluded key points. 3. **Excessive model parameters**: Existing high - precision models usually have a large number of parameters, resulting in excessive consumption of computing resources, which is not conducive to practical applications. ### Solutions To deal with the above problems, the authors propose the following innovations: 1. **Context - aware Feature Transformer Network (CaFTNet)** - **Transformerneck**: By introducing the Contextual Transformer (CoT) module and replacing the 3×3 convolution in the HRNet bottleneck, the contextual representation ability of features is enhanced. - **Attention Feature Aggregation Module (AFAM)**: The attention mechanism is used to further refine the fused output features, improving the feature fusion effect in the multi - scale network. 2. **Lightweight design** - While maintaining high performance, the number of model parameters is significantly reduced. For example, on the COCO dataset, the number of parameters of CaFTNet - H4 is reduced by 72.9% compared with other methods, and on the MPII dataset, the number of parameters is reduced by 60.7%. ### Experimental results - **COCO dataset**: CaFTNet - H4 achieves an AP value of 76.2 on the COCO val2017 dataset, which is significantly better than other methods. - **MPII dataset**: On the MPII dataset, CaFTNet also achieves results comparable to existing methods, but with fewer parameters. Through these improvements, CaFTNet not only improves the accuracy of human pose estimation, but also reduces the complexity and computational cost of the model, making this method more suitable for practical applications. ### Summary The main contribution of this paper is to propose a lightweight Context - aware Feature Transformer Network (CaFTNet), which solves the problem of insufficient global dependency modeling ability of traditional CNNs in human pose estimation, and verifies its effectiveness through experiments.

A Lightweight Context-Aware Feature Transformer Network for Human Pose Estimation

Context-Guided Adaptive Network for Efficient Human Pose Estimation.

Adaptively Fusing Complete Multi-resolution Features for Human Pose Estimation.

X-HRNet: Towards Lightweight Human Pose Estimation with Spatially Unidimensional Self-Attention

Complementary Feature Pyramid Network for Human Pose Estimation

HEViTPose: High-Efficiency Vision Transformer for Human Pose Estimation

Lightweight high-resolution network based on adaptive cross-dimensional weighting for human pose estimation

Human Pose Estimation Based on Lightweight Multi-Scale Coordinate Attention

Bilateral Pose Transformer for Human Pose Estimation.

Shift Pose: A Lightweight Transformer-like Neural Network for Human Pose Estimation

Implicit Decouple Network for Efficient Pose Estimation

An improved lightweight high-resolution network based on multi-dimensional weighting for human pose estimation

MSRT: multi-scale representation transformer for regression-based human pose estimation

A lightweight attention-driven distillation model for human pose estimation

Human Pose Estimation Based on Efficient and Lightweight High-Resolution Network (EL-HRNet)

Combining detailed appearance and multi-scale representation: a structure-context complementary network for human pose estimation

Attention-Enhanced Lightweight Hourglass Network for Human Pose Estimation

EANet: Towards Lightweight Human Pose Estimation With Effective Aggregation Network

Gated Region-Refine Pose Transformer for Human Pose Estimation.

HRPoseFormer: High-Resolution Transformer for Human Pose Estimation Via Multi-Scale Token Aggregation

Simple and Lightweight Human Pose Estimation