Abstract:High-performing, real-time pose detection and tracking in real-time will enable computers to develop a finer-grained and more natural understanding of human behavior. However, the implementation of real-time human pose estimation remains a challenge. On the one hand, the performance of semantic keypoint tracking in live video footage requires high computational resources and large parameters, which limiting the accuracy of pose estimation. On the other hand, some transformer-based models were proposed recently with outstanding performance and much fewer parameters and FLOPs. However, the self-attention module in the transformer is not computationally friendly, which makes it difficult to apply these excellent models to real-time jobs. To overcome the above problems, we propose a transformer-like model, named ShiftPose, which is regression-based approach. The ShiftPose does not contain any self-attention module. Instead, we replace the self-attention module with a non-parameter operation called the shift operator. Meanwhile, we adapt the bridge-branch connection, instead of a fully-branched connection, such as HRNet, as our multi-resolution integration scheme. Specifically, the bottom half of our model adds the previous output, as well as the output from the top half of our model, corresponding to its resolution. Finally, the simple, yet promising, disentangled representation (SimDR) was used in our study to make the training process more stable. The experimental results on the MPII datasets were 86.4 PCKH, 29.1PCKH@0.1. On the COCO dataset, the results were 72.2 mAP and 91.5 AP50, 255 fps on GPU, with 10.2M parameters, and 1.6 GFLOPs. In addition, we tested our model for single-stage 3D human pose estimation and draw several useful and exploratory conclusions. The above results show good performance, and this paper provides a new method for high-performance, real-time attitude detection and tracking.

InfPose: Real-Time Infrared Multi-Human Pose Estimation for Edge Devices Based on Encoder-Decoder CNN Architecture

Context-Guided Adaptive Network for Efficient Human Pose Estimation.

Human Pose Estimation from Depth Images via Inference Embedded Multi-task Learning

Shift Pose: A Lightweight Transformer-like Neural Network for Human Pose Estimation

Pose-native Network Architecture Search for Multi-person Human Pose Estimation

HEViTPose: High-Efficiency Vision Transformer for Human Pose Estimation

RTMPose: Real-Time Multi-Person Pose Estimation based on MMPose

EfficientPose: Scalable single-person pose estimation

Efficient Human Pose Estimation via 3D Event Point Cloud

FastHand: Fast monocular hand pose estimation on embedded systems

DIR-BHRNet: A Lightweight Network for Real-time Vision-based Multi-person Pose Estimation on Smartphones

A Compact and Powerful Single-Stage Network for Multi-Person Pose Estimation

MVPose: Realtime Multi-Person Pose Estimation Using Motion Vector on Mobile Devices

Human Pose Estimation in Monocular Omnidirectional Top-View Images

Lite Pose: Efficient Architecture Design for 2D Human Pose Estimation

Multi-task neural network with physical constraint for real-time multi-person 3D pose estimation from monocular camera

MovePose: A High-performance Human Pose Estimation Algorithm on Mobile and Edge Devices

3D Human Pose Estimation with Single Image and Inertial Measurement Unit (IMU) Sequence

ProcNet: Deep Predictive Coding Model for Robust-to-occlusion Visual Segmentation and Pose Estimation

MobiPose: real-time multi-person pose estimation on mobile devices

Deep3DPose: Realtime Reconstruction of Arbitrarily Posed Human Bodies from Single RGB Images