Keypoints-based Multimodal Network for Robust Human Action Recognition

Zesheng Hu,Genlin Ji,Jiaquan Gao,Bin Zhao,Xichen Yang
DOI: https://doi.org/10.1145/3686397.3686410
2024-01-01
Abstract:Skeleton-based action recognition has garnered widespread attention. However, due to the inherent limitations of skeleton sequences, existing works often confuse actions with inter-class similarities and struggle to meet the requirement for viewpoint invariance. As a solution, multimodal action recognition leverages the complementarity of information between modalities to significantly enhance the performance of unimodal models. However, effectively integrating these modalities remains an open problem. In this work, we first propose a keypoints-based multimodal data fusion method to construct images that adequately represent the crucial spatiotemporal characteristics and their variations of actions. Building upon this, we introduce the keypoints-based multimodal fusion network (KBMN), which comprehensively learns action features from skeleton, RGB, and depth data. Extensive experiments on two large-scale datasets demonstrate that our KBMN exhibits robust performance in both unimodal and multimodal action recognition tasks. As an auxiliary model for skeleton-based methods, KBMN effectively assists various baseline methods in improving their recognition accuracy.
What problem does this paper attempt to address?