KTPFormer: Kinematics and Trajectory Prior Knowledge-Enhanced Transformer for 3D Human Pose Estimation

Jihua Peng,Yanghong Zhou,P.Y. Mok

2024-04-02

Abstract:This paper presents a novel Kinematics and Trajectory Prior Knowledge-Enhanced Transformer (KTPFormer), which overcomes the weakness in existing transformer-based methods for 3D human pose estimation that the derivation of Q, K, V vectors in their self-attention mechanisms are all based on simple linear mapping. We propose two prior attention modules, namely Kinematics Prior Attention (KPA) and Trajectory Prior Attention (TPA) to take advantage of the known anatomical structure of the human body and motion trajectory information, to facilitate effective learning of global dependencies and features in the multi-head self-attention. KPA models kinematic relationships in the human body by constructing a topology of kinematics, while TPA builds a trajectory topology to learn the information of joint motion trajectory across frames. Yielding Q, K, V vectors with prior knowledge, the two modules enable KTPFormer to model both spatial and temporal correlations simultaneously. Extensive experiments on three benchmarks (Human3.6M, MPI-INF-3DHP and HumanEva) show that KTPFormer achieves superior performance in comparison to state-of-the-art methods. More importantly, our KPA and TPA modules have lightweight plug-and-play designs and can be integrated into various transformer-based networks (i.e., diffusion-based) to improve the performance with only a very small increase in the computational overhead. The code is available at:

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in 3D human pose estimation, the existing Transformer - based methods, when generating Q, K, V vectors in the self - attention mechanism, only rely on simple linear mappings, which leads to insufficient ability to effectively model the spatial relationships between joints and the motion trajectory information in time series. Specifically, when dealing with 3D human pose estimation, the existing methods have difficulty fully capturing the spatial correlation of the human anatomical structure and the temporal correlation of joint motion trajectories, thus affecting the performance of the model. To solve this problem, the paper proposes a novel Kinematics and Trajectory Prior Knowledge - Enhanced Transformer (KTPFormer). By introducing two prior attention modules - Kinematics Prior Attention (KPA) and Trajectory Prior Attention (TPA), it utilizes the known information of human anatomical structure and motion trajectories to enhance the global dependence and feature learning ability of Transformer in the multi - head self - attention mechanism. KPA models the kinematic relationships between joints by constructing a human kinematic topology, while TPA learns the information of joint motion trajectories between frames by constructing a trajectory topology. These two modules enable KTPFormer to model spatial and temporal correlations simultaneously, thus achieving better performance than the existing state - of - the - art methods on multiple benchmark datasets. In addition, the KPA and TPA modules are designed to be lightweight and easy to integrate, and can be seamlessly embedded into various Transformer - based networks, significantly improving performance with only a very small increase in computational overhead.

KTPFormer: Kinematics and Trajectory Prior Knowledge-Enhanced Transformer for 3D Human Pose Estimation

EVOPOSE: A Recursive Transformer For 3D Human Pose Estimation With Kinematic Structure Priors

3D Human Pose Estimation with Spatial and Temporal Transformers

HSTFormer: Hierarchical Spatial-Temporal Transformers for 3D Human Pose Estimation

VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose Estimation

Spatial-temporal-spectral Transformer for 3D Human Pose Estimation.

Towards Precise 3D Human Pose Estimation with Multi-Perspective Spatial-Temporal Relational Transformers

DGFormer: Dynamic Graph Transformer for 3D Human Pose Estimation

CrossFormer: Cross Spatio-Temporal Transformer for 3D Human Pose Estimation

Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation

Refined Temporal Pyramidal Compression-and-Amplification Transformer for 3D Human Pose Estimation

HDFormer: High-order Directed Transformer for 3D Human Pose Estimation

EMHIFormer: an Enhanced Multi-Hypothesis Interaction Transformer for 3D Human Pose Estimation in Video

STRFormer: Spatial–Temporal–ReTemporal Transformer for 3D Human Pose Estimation

ICRFormer: an Improving Cos-Reweighting Transformer for 3D Human Pose Estimation in Video

ConvFormer: Parameter Reduction in Transformer Models for 3D Human Pose Estimation by Leveraging Dynamic Multi-Headed Convolutional Attention

Cross-Space-Time 3D Human Body Pose Estimation Based on Transformer

<i>ST<SUP>2</SUP>PE</i>: Spatial and Temporal Transformer for Pose Estimation

DSTFormer: 3D Human Pose Estimation with a Dual-scale Spatial and Temporal Transformer Network

MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation