Spatial-temporal Transformer for Skeleton-based Action Recognition

Qipeng Zhang,Tian Wang,Mengyi Zhang,Kexin Liu,Peng Shi,Hichem Snoussi
DOI: https://doi.org/10.1109/cac53003.2021.9728206
2021-01-01
Abstract:In the area of skeleton-based human action recognition, GCN has achieved good results in previous research due to its excellent modeling ability on graph data. Recently, transformers have achieved extraordinary results in many computer vision fields. Comparing transformer and GCN, from a certain point of view, we can regard transformer as a kind of dynamic GCN, and the weight of each node is dynamically determined by data. In this work, a three-dimensional position encoding was proposed by us to solve the representation of node spatial information, in order to apply the transformer to the graph data. In addition, similar to Spatial-Temporal Graph Convolutional Networks (ST-GCN), we proposed a Space-Time Transformer (ST-TR), which applies transformers in space and time to extract spatiotemporal feature of skeleton data to complete action recognition.
What problem does this paper attempt to address?