Abstract:The critical problem in skeleton-based action recognition is to extract high-level semantics from dynamic changes between skeleton joints. Therefore, Graph Convolutional Networks (GCNs) are widely applied to capture the spatial-temporal information of dynamic joint coordinates by graph-based convolution. However, previous GCNS with fixed graph convolution kernel are limited to the static topology of graphs and the geometric variations of actions. Moreover, the local information of adjacent nodes of the graph is aggregated layer by layer, which increases the model complexity. In this work, a Deformable Graph Convolutional Transformer (DGT) for skeleton-based action recognition is proposed to extract adaptive features via a flexible receptive field that is learnable. In our DGT model, a multiple-input-branches (MIB) architecture is adopted to obtain multiple information, such as joints, bones, and motions. The multiple features are fused in the Transformer Classifier. Then, the Spatial-Temporal Graph Convolution units (STGC) are used to learn a preliminary feature representation indicating both spatial and temporal dependencies on the graph. Next, a Deformable spatial-temporal compound attention backbone is followed, which learns to represent a robust feature via adaptive deformable skeleton features. The adaptive representation is obtained by dynamically adjusting its receptive field owing to the offset-based convolution method. In addition, a self-attention-based transformer classifier (TC) is designed to encode the sequence of features flattened on the spatial and temporal dimensions. The fully-connected attention mechanism further helps the high-level semantic representation by focusing on essential nodes in the graph. We evaluated DGT on two challenging large-scale datasets, NTU-RGBD 60 and NTU-RGBD 120. Experiment results support the efficacy of DGT to optimize the attention for different joints adaptively. A comparable performance but much more efficient than the state-of-the-art demonstrates the effectiveness of the proposed method.

Multi-Modal Transformer with Skeleton and Text for Action Recognition

Multi-Scale Adaptive Skeleton Transformer for action recognition

A Novel Two-Stream Transformer-Based Framework for Multi-Modality Human Action Recognition

STSD: spatial–temporal semantic decomposition transformer for skeleton-based action recognition

Cmf-transformer: cross-modal fusion transformer for human action recognition

Spatial Temporal Transformer Network for Skeleton-based Action Recognition

MSST-RT: Multi-Stream Spatial-Temporal Relative Transformer for Skeleton-Based Action Recognition

STDM-transformer: Space-time dual multi-scale transformer network for skeleton-based action recognition

3Mformer: Multi-order Multi-mode Transformer for Skeletal Action Recognition

Shifting Perspective to See Difference: A Novel Multi-View Method for Skeleton Based Action Recognition

Transformer for Skeleton-based Action Recognition: A Review of Recent Advances

STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition

Spatial-temporal Transformer for Skeleton-based Action Recognition

A Skeleton-Based Assembly Action Recognition Method with Feature Fusion for Human-Robot Collaborative Assembly

Skeleton Action Recognition Based on Transformer Adaptive Graph Convolution

3D Action Recognition Using Multi-Temporal Skeleton Visualization.

Skeleton-Indexed Deep Multi-Modal Feature Learning for High Performance Human Action Recognition

Multi-Modal Enhancement Transformer Network for Skeleton-Based Human Interaction Recognition

Deformable graph convolutional transformer for skeleton-based action recognition

STEP CATFormer: Spatial-Temporal Effective Body-Part Cross Attention Transformer for Skeleton-based Action Recognition

A Two-stream Hybrid CNN-Transformer Network for Skeleton-based Human Interaction Recognition