Abstract:Skeleton-based action recognition has attracted more and more attention due to its advantage in avoiding environmental noise including light transformations and viewpoint changing. Most current approaches consist of interleaving spatial-only modules and temporal-only modules, where messages are first spatially aggregated between human joints and then temporally delivered. Although efficient, such factorized design hinders direct information flow among joints in adjacent frames, thus failing to (1) effectively convey information in a larger neighborhood and (2) be inferior to capture short-term motion to distinguish similar action pairs. To handle these limitations, we propose spatial-temporal graph attention network (STGAT) to dynamically model local cross-spacetime information. Keeping the temporal-only modules unchanged, it inflates the spatial only modules to perform spatial-temporal modeling from the local spatial-temporal neighborhood without a second transmission cost. Especially, different from previous spatial-temporal modeling methods, graph edges are adaptively computed to dynamically aggregate beneficial information to capture short-term movements, getting rid of predefined fixed connectivity. This ensures STGAT could dynamically attend to important joints in a local spatial-temporal neighborhood to capture critical short-term movements. While STGAT is theoretically effective in introducing local spatial-temporal information by comprising a large spatial-temporal neighborhood, we find its effectiveness is impeded by the inherent redundancy in local features. We propose three simple modules to reduce local feature redundancy to calibrate the feature modeling schema and further release the potential of STGAT, which (1) narrow the scope of self-attention operators, (2) dynamically weight joints along the temporal dimension, and (3) separate subtle motion from static features, respectively. As a result, STGAT achieves state-of-the-art performance on three large-scale datasets: NTU RGB+D 60, NTU RGB+D 120, and Kinetics Skeleton 400. Thanks to special care on local cross-spacetime information, STGAT generalizes better upon classifying hard similar actions than previous methods. Visualizations demonstrate that STGAT could intelligently attend to critical joints when dealing with different actions. Code is available at https://github.com/hulianyuyy/STGAT.

Glimpse and Zoom: Spatio-Temporal Focused Dynamic Network for Skeleton-based Action Recognition

SpatioTemporal Focus for Skeleton-based Action Recognition

Spatial Temporal Graph Attention Network for Skeleton-Based Action Recognition

Skeleton-based action recognition with local dynamic spatial-temporal aggregation

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

A Spatiotemporal Fusion Network for Skeleton-Based Action Recognition

Dynamic spatial-temporal topology graph network for skeleton-based action recognition

STACE-GCN: A Spatio-Temporal-Aware Channel Excited Graph Convolutional Network for Skeleton-based Action Recognition.

Dynamic Spatial-temporal Hypergraph Convolutional Network for Skeleton-based Action Recognition

Temporal Refinement Graph Convolutional Network for Skeleton-based Action Recognition

An improved spatial temporal graph convolutional network for robust skeleton-based action recognition

Spatio-Temporal Inception Graph Convolutional Networks for Skeleton-Based Action Recognition.

Dynamic Semantic-Based Spatial-Temporal Graph Convolution Network for Skeleton-Based Human Action Recognition

Multi‐temporal scale aggregation refinement graph convolutional network for skeleton‐based action recognition

TSGCNeXt: Dynamic-Static Multi-Graph Convolution for Efficient Skeleton-Based Action Recognition with Long-term Learning Potential

Densely Connected and Multiple Temporal Graph Convolution Networks for Skeleton-based Action Recognition

Multilevel Spatial-Temporal Excited Graph Network for Skeleton-Based Action Recognition

Spatial-Temporal Adaptive Graph Convolutional Network for Skeleton-Based Action Recognition.

Human Skeleton Feature Optimizer and Adaptive Structure Enhancement Graph Convolution Network for Action Recognition

A Tri-Attention Enhanced Graph Convolutional Network for Skeleton-Based Action Recognition

Spatial‐temporal Slowfast Graph Convolutional Network for Skeleton‐based Action Recognition