TCPFormer: Learning Temporal Correlation with Implicit Pose Proxy for 3D Human Pose Estimation

Jiajie Liu,Mengyuan Liu,Hong Liu,Wenhao Li
2025-01-03
Abstract:Recent multi-frame lifting methods have dominated the 3D human pose estimation. However, previous methods ignore the intricate dependence within the 2D pose sequence and learn single temporal correlation. To alleviate this limitation, we propose TCPFormer, which leverages an implicit pose proxy as an intermediate representation. Each proxy within the implicit pose proxy can build one temporal correlation therefore helping us learn more comprehensive temporal correlation of human motion. Specifically, our method consists of three key components: Proxy Update Module (PUM), Proxy Invocation Module (PIM), and Proxy Attention Module (PAM). PUM first uses pose features to update the implicit pose proxy, enabling it to store representative information from the pose sequence. PIM then invocates and integrates the pose proxy with the pose sequence to enhance the motion semantics of each pose. Finally, PAM leverages the above mapping between the pose sequence and pose proxy to enhance the temporal correlation of the whole pose sequence. Experiments on the Human3.6M and MPI-INF-3DHP datasets demonstrate that our proposed TCPFormer outperforms the previous state-of-the-art methods.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **Existing 3D human pose estimation methods are insufficient in modeling the complex temporal correlations in 2D pose sequences, resulting in a gradually saturated performance improvement**. Specifically, the existing multi - frame methods only establish a 1 - to - T mapping relationship between each pose and the pose sequence, and cannot comprehensively reflect the complex temporal correspondence in the pose sequence. ### Problem Background 3D human pose estimation is an important problem in computer vision, aiming to locate the 3D positions of human joints. Although methods based on deep learning have made remarkable progress in recent years, these methods still face the problem of depth ambiguity. To alleviate this problem, many studies have attempted to utilize the temporal information between adjacent frames, such as using temporal convolution, graph convolution, and Transformer architectures to capture long - term temporal dependencies. However, as the number of input frames increases, the performance improvement gradually becomes slow. For example, when PoseFormerV2 expands the input frames from 81 frames to 243 frames, the error is only reduced by 0.8mm; when StridedTrans expands the input frames from 243 frames to 351 frames, the error reduction is only 0.3mm. This indicates that most methods have limitations in effectively modeling the temporal correlations in 2D pose sequences. ### Method Proposed in the Paper To solve the above problems, the author proposes TCPFormer (Temporal Correlation with Implicit Pose Proxy), which more comprehensively models the temporal correlations in the pose sequence by introducing an implicit pose proxy as an intermediate representation. Specifically: - **Implicit Pose Proxy**: Each proxy can establish a 1 - to - T mapping, thus helping the model learn more comprehensive temporal correlations. - **Three Key Modules**: - **Proxy Update Module (PUM)**: Use the cross - attention mechanism to update the pose proxy so that it can store representative information in the pose sequence. - **Proxy Invocation Module (PIM)**: Use the updated pose proxy to enhance the feature representation ability of the pose sequence. - **Proxy Attention Module (PAM)**: Combine the two cross - attention matrices generated by PUM and PIM to obtain an aggregated attention matrix and fuse it with the original self - attention matrix to learn more effective temporal correlations. ### Experimental Results The experimental results show that TCPFormer outperforms existing methods on both the Human3.6M and MPI - INF - 3DHP benchmark datasets. In particular, on the MPI - INF - 3DHP dataset, even when the number of input frames is only 9 frames, the performance of TCPFormer exceeds the results of other methods using 81 - frame inputs. ### Summary TCPFormer effectively solves the deficiencies of existing methods in modeling complex temporal correlations by introducing the implicit pose proxy and its three innovative modules, and significantly improves the performance of 3D human pose estimation.