Spatial-Temporal Synchronous Transformer for Skeleton-Based Hand Gesture Recognition

Dongdong Zhao,Hongli Li,Shi Yan
DOI: https://doi.org/10.1109/tcsvt.2023.3295084
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Capturing the long-range spatial-temporal correlation among joints of dynamic skeletal data efficiently is very challenging in hand gesture recognition (HGR). The flexibility of Transformer in modeling global dependencies among elements of any sequence makes it a perfect solution for skeleton-based HGR. However, the existing Transformer-based approaches only capture the correlation of intra-frame and inter-frame joints, respectively, without considering the relationship among different joints in several successive frames. In this paper, a novel spatial-temporal synchronous transformer (STST) method is proposed for skeleton-based HGR. The spatial-temporal chunks encoding module is proposed to encode the hand gesture skeleton sequence (HGSS) into several chunks, in which each chunk contains several consecutive frames to encode the relationship among spatial-temporal joints. Then, the encoding feature is fed into a spatial-temporal chunks transformer module and a temporal integration transformer module to model the spatial-temporal correlation of HGSSs, simultaneously, so that a more comprehensive understanding of the global and local spatial-temporal information can be achieved. In this way, the spatial-temporal information among joints can be efficiently extracted and utilized to better understand the semantics of gesture actions and then yield a higher recognition accuracy. Extensive experiments on SHREC’17 Track dataset and DHG-14/28 dataset show that the proposed method achieves the state-of-the-art performance compared with other representative methods.
What problem does this paper attempt to address?