Abstract:High-performing, real-time pose detection and tracking in real-time will enable computers to develop a finer-grained and more natural understanding of human behavior. However, the implementation of real-time human pose estimation remains a challenge. On the one hand, the performance of semantic keypoint tracking in live video footage requires high computational resources and large parameters, which limiting the accuracy of pose estimation. On the other hand, some transformer-based models were proposed recently with outstanding performance and much fewer parameters and FLOPs. However, the self-attention module in the transformer is not computationally friendly, which makes it difficult to apply these excellent models to real-time jobs. To overcome the above problems, we propose a transformer-like model, named ShiftPose, which is regression-based approach. The ShiftPose does not contain any self-attention module. Instead, we replace the self-attention module with a non-parameter operation called the shift operator. Meanwhile, we adapt the bridge-branch connection, instead of a fully-branched connection, such as HRNet, as our multi-resolution integration scheme. Specifically, the bottom half of our model adds the previous output, as well as the output from the top half of our model, corresponding to its resolution. Finally, the simple, yet promising, disentangled representation (SimDR) was used in our study to make the training process more stable. The experimental results on the MPII datasets were 86.4 PCKH, 29.1PCKH@0.1. On the COCO dataset, the results were 72.2 mAP and 91.5 AP50, 255 fps on GPU, with 10.2M parameters, and 1.6 GFLOPs. In addition, we tested our model for single-stage 3D human pose estimation and draw several useful and exploratory conclusions. The above results show good performance, and this paper provides a new method for high-performance, real-time attitude detection and tracking.

Bilateral Pose Transformer for Human Pose Estimation.

HEViTPose: High-Efficiency Vision Transformer for Human Pose Estimation

HRPVT: High-Resolution Pyramid Vision Transformer for medium and small-scale human pose estimation

ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

ViTPose++: Vision Transformer for Generic Body Pose Estimation

HRPoseFormer: High-Resolution Transformer for Human Pose Estimation Via Multi-Scale Token Aggregation

GITPose: going shallow and deeper using vision transformers for human pose estimation

A Lightweight Context-Aware Feature Transformer Network for Human Pose Estimation

3D human pose estimation with multi-hypotheses gated transformer

3D Human Pose Estimation with Spatial and Temporal Transformers

Shift Pose: A Lightweight Transformer-like Neural Network for Human Pose Estimation

DPIT: Dual-Pipeline Integrated Transformer for Human Pose Estimation

VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose Estimation

Poseur: Direct Human Pose Regression with Transformers.

Gated Region-Refine Pose Transformer for Human Pose Estimation.

Joint graph convolution networks and transformer for human pose estimation in sports technique analysis

GTPT: Group-based Token Pruning Transformer for Efficient Human Pose Estimation

TFPose: Direct Human Pose Estimation with Transformers

PPT: token-Pruned Pose Transformer for monocular and multi-view human pose estimation

DSTFormer: 3D Human Pose Estimation with a Dual-scale Spatial and Temporal Transformer Network

Towards Precise 3D Human Pose Estimation with Multi-Perspective Spatial-Temporal Relational Transformers