Abstract:The critical problem in skeleton-based action recognition is to extract high-level semantics from dynamic changes between skeleton joints. Therefore, Graph Convolutional Networks (GCNs) are widely applied to capture the spatial-temporal information of dynamic joint coordinates by graph-based convolution. However, previous GCNS with fixed graph convolution kernel are limited to the static topology of graphs and the geometric variations of actions. Moreover, the local information of adjacent nodes of the graph is aggregated layer by layer, which increases the model complexity. In this work, a Deformable Graph Convolutional Transformer (DGT) for skeleton-based action recognition is proposed to extract adaptive features via a flexible receptive field that is learnable. In our DGT model, a multiple-input-branches (MIB) architecture is adopted to obtain multiple information, such as joints, bones, and motions. The multiple features are fused in the Transformer Classifier. Then, the Spatial-Temporal Graph Convolution units (STGC) are used to learn a preliminary feature representation indicating both spatial and temporal dependencies on the graph. Next, a Deformable spatial-temporal compound attention backbone is followed, which learns to represent a robust feature via adaptive deformable skeleton features. The adaptive representation is obtained by dynamically adjusting its receptive field owing to the offset-based convolution method. In addition, a self-attention-based transformer classifier (TC) is designed to encode the sequence of features flattened on the spatial and temporal dimensions. The fully-connected attention mechanism further helps the high-level semantic representation by focusing on essential nodes in the graph. We evaluated DGT on two challenging large-scale datasets, NTU-RGBD 60 and NTU-RGBD 120. Experiment results support the efficacy of DGT to optimize the attention for different joints adaptively. A comparable performance but much more efficient than the state-of-the-art demonstrates the effectiveness of the proposed method.

DGT: Dynamic Graph Transformer for Enhanced Processing of Dynamic Joint Sequences in 2D Human Pose Estimation

Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation

DSTFormer: 3D Human Pose Estimation with a Dual-scale Spatial and Temporal Transformer Network

Learning Dynamical Human-Joint Affinity for 3D Pose Estimation in Videos

3D Human Pose Estimation with Spatial and Temporal Transformers

Double-chain Constraints for 3D Human Pose Estimation in Images and Videos

3D human pose estimation with multi-hypotheses gated transformer

Joint graph convolution networks and transformer for human pose estimation in sports technique analysis

Towards Precise 3D Human Pose Estimation with Multi-Perspective Spatial-Temporal Relational Transformers

Deformable graph convolutional transformer for skeleton-based action recognition

SPGformer: Serial–Parallel Hybrid GCN-Transformer With Graph-Oriented Encoder for 2-D-to-3-D Human Pose Estimation

KTPFormer: Kinematics and Trajectory Prior Knowledge-Enhanced Transformer for 3D Human Pose Estimation

<i>ST<SUP>2</SUP>PE</i>: Spatial and Temporal Transformer for Pose Estimation

Multi-hop graph transformer network for 3D human pose estimation

A Spatio-Temporal Transformer Network for Human Motion Prediction in Human-Robot Collaboration

Graph-aware transformer for skeleton-based action recognition

GATOR: Graph-Aware Transformer with Motion-Disentangled Regression for Human Mesh Recovery from a 2D Pose

PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation

Human Pose Estimation Via Dynamic Information Transfer

GITPose: going shallow and deeper using vision transformers for human pose estimation