Human-centric multimodal fusion network for robust action recognition
Zesheng Hu,Jian Xiao,Le Li,Cun Liu,Genlin Ji
DOI: https://doi.org/10.1016/j.eswa.2023.122314
IF: 8.5
2024-04-01
Expert Systems with Applications
Abstract:Skeleton-based methods have made remarkable strides in human action recognition (HAR). However, the performance of existing unimodal approaches is still limited by the lack of diverse visual features in skeleton data. Concretely, due to the absence of interaction information between individuals and objects, skeleton-based methods tend to confuse similar actions. Moreover, the view invariant property of unimodal models is susceptible to restrictions. In this work, we propose an innovative skeleton-guided multimodal data fusion methodology that transforms depth, RGB, and optical flow modalities into human-centric images (HCI) based on keypoint sequences. Building upon this foundation, we introduce a human-centric multimodal fusion network (HCMFN), which can comprehensively extract the action patterns of different modalities. Our model significantly enhances the performance of skeleton-based techniques, achieving remarkable results with rapid inference speed. Extensive experiments on two large-scale multimodal datasets, namely NTU RGB+D and NTU RGB+D 120, validate the capacity of HCMFN to bolster the robustness of skeleton-based methods in two challenging HAR tasks: (1) discriminating between actions with subtle inter-class differences, and (2) recognizing actions from varying viewpoints. Compared to state-of-the-art multimodal methods, our HCMFN achieves exciting results.
computer science, artificial intelligence,engineering, electrical & electronic,operations research & management science