Spatial-Temporal Unfold Transformer for Skeleton-based Human Action Recognition

HU CUI,Tessai Hayama
DOI: https://doi.org/10.52731/liir.v004.167
2023-01-01
Abstract:Transformer-based architecture has been proven to be effective for action and gesture recognition. In contrast to Graph Convolutional Networks (GCNs), it can automatically model joint relationships through attention mechanisms without any predefined topological graph. However, most of the previous approaches do attention to the spatial and temporal dimensions in a completely decoupled manner, ignoring the local dynamic features of the action and human body semantics. And the performance lag behind state-of-the-art GCN-based methods. To overcome the issues, we propose a Spatial-Temporal Unfold Attention Network (STUT). Firstly, it locally unfolds skeleton data in the temporal dimension such that all neighboring frames are included in each unfolded frame. Then, the human body structural semantics of actions are extracted by a hypergraph convolution used for guiding the local spatio-temporal attention operation in each unfolded frame.In addition, in order to distinguish the importance of different frames, we introduce temporal squeezing attention (TSE) for multi-scale global spatial-temporal modeling. Extensive experiments are conducted and our model achieves 96.4\% on NW-UCLA and 96.91\% / 94.88\% on SHREC17 (14-gestures / 28-gestures).
What problem does this paper attempt to address?