When Skeleton Meets Motion: Adaptive Multimodal Graph Representation Fusion for Action Recognition

Xiao Liu,Guan Yuan,Rui Bing,Zhuo Cai,Shengshen Fu,Yonghao Yu
DOI: https://doi.org/10.1109/icme57554.2024.10688272
2024-01-01
Abstract:Multimodal action recognition can use complementary information from multiple modality data to identify human behaviors, and has achieved remarkable results. However, existing multimodal fusion methods often overlook the difference in contribution within intra-joints and inter-joints, which limits them to discerning ambiguous actions when different actions with similar sequences. Besides, the modality gap hinders graph convolutional networks in extracting correlation information between multimodal action data. To solve the above problems, we propose a multimodal action recognition method based on Adaptive Multimodal Graph Representation Fusion model (AMGRF). Firstly, we use skeleton data and wearable sensor data jointly to depict human actions, and construct heterogeneous graph derived from skeleton graph and sensor graph to mine inter-modal correlation information. Secondly, we design adaptive multimodal graph representation fusion module to achieve node-level action feature fusion among intra-joints and inter-joints via Gumbel-Softmax. Finally, extensive experiments on three public datasets (CZU-MHAD, UTD-MHAD, and Berkeley-MHAD) substantiate the superiority of AMGRF over state-of-the-art methods.
What problem does this paper attempt to address?