Abstract:Estimating human poses from a video is at the foundation of many visual intelligent systems. Various convolutional neural networks have been proposed, achieving state‐of‐the‐art performance on different image datasets. However, most existing approaches are image based, which deliver unreliable estimations on videos since they fail to model temporal consistency across video frames. Recently, another line of work leverages temporal cues for multi‐frame person pose estimation, yet still in an instance‐unaware fashion, disregarding the specific traits of different instances (persons) or different joints. In this paper, we propose a novel approach to learn specific keypoint motion representations for each person, termed Personalized Motion‐Aware Network (PMAN). In the PMAN, we devise three components: (i) an Instance‐Sensitive Extractor that adaptively computes the spatial features according to human physical characteristics; (ii) a Keypoint Motion Encoder that separately generates convolution kernels with fine‐grained keypoint motion encoding; (iii) a Motion Driven Decoder that parses multi‐frame spatial features of the same person to provide precise human pose estimations. Extensive experiments on PoseTrack2017 and PoseTrack2018 datasets demonstrate that our approach greatly improves the performance of multi‐frame human pose estimation. It is worth mentioning that our approach surpasses the state‐of‐the‐art method by +1.7 mAP and achieves 82.9 mAP on PoseTrack2017 dataset.

Self-supervised Siamese keypoint inference network for human pose estimation and tracking

Multi-modal 3D Human Tracking for Robots in Complex Environment with Siamese Point-Video Transformer

3D Point-to-Keypoint Voting Network for 6D Pose Estimation

PointSiamRCNN: Target-aware Voxel-based Siamese Tracker for Point Clouds

Robust Human Tracking Via Key Face Information.

Self-supervised Keypoint Correspondences for Multi-Person Pose Estimation and Tracking in Videos

DFSTrack: Dual-stream fusion Siamese network for human pose tracking in videos

Multi-person pose estimation using atrous convolution

Siamese Attentional Cascade Keypoints Network for Visual Object Tracking

Multi-Person Pose Tracking With Sparse Key-Point Flow Estimation and Hierarchical Graph Distance Minimization

Pose Estimation for Swimmers in Video Surveillance

Improving Multi-Person Pose Tracking with A Confidence Network

Multi-Scale Supervised Network for Human Pose Estimation

Deep Dual Consecutive Network for Human Pose Estimation

Parallel Self-Attention and Spatial-Attention Fusion for Human Pose Estimation and Running Movement Recognition

Personalized motion kernel learning for human pose estimation

Towards High Performance Human Keypoint Detection

Siamese Local and Global Networks for Robust Face Tracking

Detect-and-Track: Efficient Pose Estimation in Videos

Siamese Attentional Keypoint Network for High Performance Visual Tracking

STPoseNet: A real-time spatiotemporal network model for robust mouse pose estimation