Multi-Modal Transformer with Skeleton and Text for Action Recognition

Xuri Jiao,Lijuan Zhou
DOI: https://doi.org/10.1109/IJCNN60899.2024.10650141
2024-06-30
Abstract:Dynamic skeleton data has been widely used for human action recognition due to its high-level semantic information and environmental robustness, represented as the 2D/3D coordinates of human joints. However, previous methods mostly utilized skeleton data only without considering the crucial role of text information in helping machines understand visual contents. This paper proposes a novel method based on multi-modal Transformer with skeleton and text (namely MMT-ST) for action recognition. The proposed method performs action captioning and recognition tasks simultaneously, which dynamically updates action recognition based on the results of action captioning. MMT-ST employs a transformer as the backbone and consists of four components: two single-modal encoders, a cross encoder, and a decoder. The single-modal encoders respectively embed skeletons and texts. The cross encoder aims to learn the underlying correlations between two modalities and further perform action recognition task through a classification head. The decoder is employed to conduct the action captioning task. Additionally, a two-stage training strategy is employed to ensure smoother model training. Extensive experiments conducted on NTU RGB+D, NTU RGB+D 120 and ETRI-Activity 3D datasets demonstrate the effectiveness of the proposed method.
Computer Science
What problem does this paper attempt to address?