Abstract:We observed that recent state-of-the-art results on single image human pose estimation were achieved by multistage Convolution Neural Networks (CNN). Notwithstanding the superior performance on static images, the application of these models on videos is not only computationally intensive, it also suffers from performance degeneration and flicking. Such suboptimal results are mainly attributed to the inability of imposing sequential geometric consistency, handling severe image quality degradation (e.g. motion blur and occlusion) as well as the inability of capturing the temporal correlation among video frames. In this paper, we proposed a novel recurrent network to tackle these problems. We showed that if we were to impose the weight sharing scheme to the multi-stage CNN, it could be re-written as a Recurrent Neural Network (RNN). This property decouples the relationship among multiple network stages and results in significantly faster speed in invoking the network for videos. It also enables the adoption of Long Short-Term Memory (LSTM) units between video frames. We found such memory augmented RNN is very effective in imposing geometric consistency among frames. It also well handles input quality degradation in videos while successfully stabilizes the sequential outputs. The experiments showed that our approach significantly outperformed current state-of-the-art methods on two large-scale video pose estimation benchmarks. We also explored the memory cells inside the LSTM and provided insights on why such mechanism would benefit the prediction for video-based pose estimations.(1)

A Multi-Person Pose Estimation with LSTM for Video Stream

LSTM Pose Machines.

Multi-Person Pose Estimation Using Bounding Box Constraint and LSTM.

Deep Dual Consecutive Network for Human Pose Estimation

Live Stream Temporally Embedded 3D Human Body Pose and Shape Estimation

3D Human pose estimation from video via multi-scale multi-level spatial temporal features

Multi-person pose estimation using atrous convolution

Modelling Human Body Pose for Action Recognition Using Deep Neural Networks

Multi-Person Pose Estimation with Enhanced Channel-wise and Spatial Information

A Deconvolutional Bottom-up Deep Network for Multi-Person Pose Estimation.

Rethinking on Multi-Stage Networks for Human Pose Estimation

Joint Human Detection and Head Pose Estimation Via Multistream Networks for RGB-D Videos

Human Pose Estimation Based on Lightweight Multi-Scale Coordinate Attention

Multi-Person Articulated Tracking With Spatial and Temporal Embeddings

Shape and Pose Estimation for Closely Interacting Persons Using Multi-view Images.

Multi-Scale Supervised Network for Human Pose Estimation

Multi-person 3D pose estimation from unlabelled data

3D Human Pose Estimation from Deep Multi-View 2D Pose

High Efficient LSTM-based Network for Human Interaction Understanding

DFSTrack: Dual-stream fusion Siamese network for human pose tracking in videos

Multi-Scale Structure-Aware Network for Human Pose Estimation