Abstract:By leveraging temporal dependency in video sequences, multi-frame human pose estimation algorithms have demonstrated remarkable results in complicated situations, such as occlusion, motion blur, and video defocus. These algorithms are predominantly based on heatmaps, resulting in high computation and storage requirements per frame, which limits their flexibility and real-time application in video scenarios, particularly on edge devices. In this paper, we develop an efficient and effective video-based human pose regression method, which bypasses intermediate representations such as heatmaps and instead directly maps the input to the output joint coordinates. Despite the inherent spatial correlation among adjacent joints of the human pose, the temporal trajectory of each individual joint exhibits relative independence. In light of this, we propose a novel Decoupled Space-Time Aggregation network (DSTA) to separately capture the spatial contexts between adjacent joints and the temporal cues of each individual joint, thereby avoiding the conflation of spatiotemporal dimensions. Concretely, DSTA learns a dedicated feature token for each joint to facilitate the modeling of their spatiotemporal dependencies. With the proposed joint-wise local-awareness attention mechanism, our method is capable of efficiently and flexibly utilizing the spatial dependency of adjacent joints and the temporal dependency of each joint itself. Extensive experiments demonstrate the superiority of our method. Compared to previous regression-based single-frame human pose estimation methods, DSTA significantly enhances performance, achieving an 8.9 mAP improvement on PoseTrack2017. Furthermore, our approach either surpasses or is on par with the state-of-the-art heatmap-based multi-frame human pose estimation methods. Project page:

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to perform efficient and accurate human pose regression in video sequences. Specifically, although existing heatmap - based methods perform excellently in human pose estimation, their high computational and storage requirements limit their flexibility in real - time video applications, especially on edge devices. In addition, existing regression - based methods, although more computationally efficient, are mainly designed for static images and ignore the temporal dependencies between video frames, resulting in a significant performance degradation when processing video inputs. Therefore, this paper proposes a new video - based human pose regression method - Decoupled Space - Time Aggregation (DSTA), aiming to overcome the above problems and improve the performance of multi - frame human pose estimation. ### Main Contributions 1. **Propose the DSTA Framework**: DSTA is a novel and effective video - based human pose regression method that can efficiently model the spatial and temporal dependencies of human joints in video sequences. 2. **First Realize Regression - Based Multi - Frame Human Pose Estimation**: Compared with heatmap - based methods, DSTA has advantages in computational and storage efficiency and is suitable for real - time video applications, especially on edge devices. 3. **Experimental Verification**: Through extensive experiments, it is proved that DSTA not only significantly outperforms previous regression - based methods in performance, but even exceeds the state - of - the - art heatmap - based methods. ### Method Overview The main modules of DSTA include: - **Backbone**: Used to extract global feature maps. - **Joint - centric Feature Decoder (JFD)**: Extract feature embeddings for each joint from the global feature map. - **Space - Time Decoupling (STD)**: Model the spatial structural dependencies and temporal dynamic dependencies of joints respectively. ### Key Technologies 1. **Joint - centric Feature Decoder (JFD)**: - Extract feature embeddings for each joint from the global feature map through convolutional layers or fully - connected layers. - The feature embedding of each joint is represented as \(\{F_j^i(t')\}_{j = 1}^n\), where \(F_j^i(t')\) is the feature embedding of the \(j\) - th joint at time \(t'\). 2. **Space - Time Decoupling (STD)**: - **Temporal Decoupling (TD)**: Capture the temporal dynamic dependencies of each joint through a local - aware attention mechanism. \[ \dot{F}_j^i(t)=\text{S - ATT}(\tilde{S}_j^i),\quad j = 1,2,\ldots,n \] where \(\tilde{S}_j^i=\langle F_j^i(t - T),\ldots,F_j^i(t),\ldots,F_j^i(t + T)\rangle\). - **Spatial Decoupling (SD)**: Capture the spatial structural dependencies of joints within the current frame through a local - aware attention mechanism. \[ \ddot{F}_j^i(t)=\text{S - ATT}(\langle F_j^i(t)\rangle_{j\in G(k)}),\quad k = 1,\ldots,K \] where \(G(k)\) represents the index set of the \(k\) - th group of joints. 3. **Spatio - Temporal Aggregation**: - Fuse the temporal and spatial feature embeddings to generate the final spatio - temporal aggregated features. \[ f_j^i(t)=\dot{F}_j^i(t)\oplus\ddot{F}_j^i(t),\quad j = 1,2,\ldots,n \]

Video-Based Human Pose Regression via Decoupled Space-Time Aggregation

Temporal Constrained Feasible Subspace Learning for Human Pose Forecasting

STN-enhanced Message Passing Guided by Adversarial Learning for Human Pose Estimation

Spatiotemporal Learning Transformer for Video-Based Human Pose Estimation

Exploring Temporal Consistency for Human Pose Estimation in Videos

Deep Dual Consecutive Network for Human Pose Estimation

Joint Multi-Scale Transformers and Pose Equivalence Constraints for 3D Human Pose Estimation

3D Human Pose Estimation with Spatio-Temporal Criss-Cross Attention

Poseur: Direct Human Pose Regression with Transformers.

3D Human pose estimation from video via multi-scale multi-level spatial temporal features

Learning Dynamical Human-Joint Affinity for 3D Pose Estimation in Videos

Relation-Based Associative Joint Location for Human Pose Estimation in Videos

MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video

APP: Adaptive Pose Pooling for 3D Human Pose Estimation from Videos

Kinematic-aware Hierarchical Attention Network for Human Pose Estimation in Videos

Learning Temporal-Spatial Contextual Adaptation for Three-Dimensional Human Pose Estimation

Towards Precise 3D Human Pose Estimation with Multi-Perspective Spatial-Temporal Relational Transformers

HSTFormer: Hierarchical Spatial-Temporal Transformers for 3D Human Pose Estimation

ARTS: Semi-Analytical Regressor using Disentangled Skeletal Representations for Human Mesh Recovery from Videos

3D Human Pose Estimation from Deep Multi-View 2D Pose

An Improved 3D Human Pose Estimation Model Based on Temporal Convolution with Gaussian Error Linear Units