Video-Based Human Pose Regression via Decoupled Space-Time Aggregation

Jijie He,Wenwu Yang
2024-04-01
Abstract:By leveraging temporal dependency in video sequences, multi-frame human pose estimation algorithms have demonstrated remarkable results in complicated situations, such as occlusion, motion blur, and video defocus. These algorithms are predominantly based on heatmaps, resulting in high computation and storage requirements per frame, which limits their flexibility and real-time application in video scenarios, particularly on edge devices. In this paper, we develop an efficient and effective video-based human pose regression method, which bypasses intermediate representations such as heatmaps and instead directly maps the input to the output joint coordinates. Despite the inherent spatial correlation among adjacent joints of the human pose, the temporal trajectory of each individual joint exhibits relative independence. In light of this, we propose a novel Decoupled Space-Time Aggregation network (DSTA) to separately capture the spatial contexts between adjacent joints and the temporal cues of each individual joint, thereby avoiding the conflation of spatiotemporal dimensions. Concretely, DSTA learns a dedicated feature token for each joint to facilitate the modeling of their spatiotemporal dependencies. With the proposed joint-wise local-awareness attention mechanism, our method is capable of efficiently and flexibly utilizing the spatial dependency of adjacent joints and the temporal dependency of each joint itself. Extensive experiments demonstrate the superiority of our method. Compared to previous regression-based single-frame human pose estimation methods, DSTA significantly enhances performance, achieving an 8.9 mAP improvement on PoseTrack2017. Furthermore, our approach either surpasses or is on par with the state-of-the-art heatmap-based multi-frame human pose estimation methods. Project page:
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to perform efficient and accurate human pose regression in video sequences. Specifically, although existing heatmap - based methods perform excellently in human pose estimation, their high computational and storage requirements limit their flexibility in real - time video applications, especially on edge devices. In addition, existing regression - based methods, although more computationally efficient, are mainly designed for static images and ignore the temporal dependencies between video frames, resulting in a significant performance degradation when processing video inputs. Therefore, this paper proposes a new video - based human pose regression method - Decoupled Space - Time Aggregation (DSTA), aiming to overcome the above problems and improve the performance of multi - frame human pose estimation. ### Main Contributions 1. **Propose the DSTA Framework**: DSTA is a novel and effective video - based human pose regression method that can efficiently model the spatial and temporal dependencies of human joints in video sequences. 2. **First Realize Regression - Based Multi - Frame Human Pose Estimation**: Compared with heatmap - based methods, DSTA has advantages in computational and storage efficiency and is suitable for real - time video applications, especially on edge devices. 3. **Experimental Verification**: Through extensive experiments, it is proved that DSTA not only significantly outperforms previous regression - based methods in performance, but even exceeds the state - of - the - art heatmap - based methods. ### Method Overview The main modules of DSTA include: - **Backbone**: Used to extract global feature maps. - **Joint - centric Feature Decoder (JFD)**: Extract feature embeddings for each joint from the global feature map. - **Space - Time Decoupling (STD)**: Model the spatial structural dependencies and temporal dynamic dependencies of joints respectively. ### Key Technologies 1. **Joint - centric Feature Decoder (JFD)**: - Extract feature embeddings for each joint from the global feature map through convolutional layers or fully - connected layers. - The feature embedding of each joint is represented as \(\{F_j^i(t')\}_{j = 1}^n\), where \(F_j^i(t')\) is the feature embedding of the \(j\) - th joint at time \(t'\). 2. **Space - Time Decoupling (STD)**: - **Temporal Decoupling (TD)**: Capture the temporal dynamic dependencies of each joint through a local - aware attention mechanism. \[ \dot{F}_j^i(t)=\text{S - ATT}(\tilde{S}_j^i),\quad j = 1,2,\ldots,n \] where \(\tilde{S}_j^i=\langle F_j^i(t - T),\ldots,F_j^i(t),\ldots,F_j^i(t + T)\rangle\). - **Spatial Decoupling (SD)**: Capture the spatial structural dependencies of joints within the current frame through a local - aware attention mechanism. \[ \ddot{F}_j^i(t)=\text{S - ATT}(\langle F_j^i(t)\rangle_{j\in G(k)}),\quad k = 1,\ldots,K \] where \(G(k)\) represents the index set of the \(k\) - th group of joints. 3. **Spatio - Temporal Aggregation**: - Fuse the temporal and spatial feature embeddings to generate the final spatio - temporal aggregated features. \[ f_j^i(t)=\dot{F}_j^i(t)\oplus\ddot{F}_j^i(t),\quad j = 1,2,\ldots,n \]