Abstract:Skeleton-based action recognition has garnered significant attention due to the utilization of concise and resilient skeletons. Nevertheless, the absence of detailed body information in skeletons restricts performance, while other multimodal methods require substantial inference resources and are inefficient when using multimodal data during both training and inference stages. To address this and fully harness the complementary multimodal features, we propose a novel multi-modality co-learning (MMCL) framework by leveraging the multimodal large language models (LLMs) as auxiliary networks for efficient skeleton-based action recognition, which engages in multi-modality co-learning during the training stage and keeps efficiency by employing only concise skeletons in inference. Our MMCL framework primarily consists of two modules. First, the Feature Alignment Module (FAM) extracts rich RGB features from video frames and aligns them with global skeleton features via contrastive learning. Second, the Feature Refinement Module (FRM) uses RGB images with temporal information and text instruction to generate instructive features based on the powerful generalization of multimodal LLMs. These instructive text features will further refine the classification scores and the refined scores will enhance the model's robustness and generalization in a manner similar to soft labels. Extensive experiments on NTU RGB+D, NTU RGB+D 120 and Northwestern-UCLA benchmarks consistently verify the effectiveness of our MMCL, which outperforms the existing skeleton-based action recognition methods. Meanwhile, experiments on UTD-MHAD and SYSU-Action datasets demonstrate the commendable generalization of our MMCL in zero-shot and domain-adaptive action recognition. Our code is publicly available at: <a class="link-external link-https" href="https://github.com/liujf69/MMCL-Action" rel="external noopener nofollow">this https URL</a>.

Keypoints-based Multimodal Network for Robust Human Action Recognition

Human-centric multimodal fusion network for robust action recognition

Skeleton Sequence and RGB Frame Based Multi-Modality Feature Fusion Network for Action Recognition

Evaluating fusion of RGB-D and inertial sensors for multimodal human action recognition

Multi-view key information representation and multi-modal fusion for single-subject routine action recognition

Skeleton-Indexed Deep Multi-Modal Feature Learning for High Performance Human Action Recognition

Multi-modality Fusion Network for Action Recognition.

Skeleton-Based Multifeatures and Multistream Network for Real-Time Action Recognition

A Key Skeleton Points Guided Classroom Action Recognition Method Based on Multimodal Symmetry Fusion

Symmetrical Enhanced Fusion Network for Skeleton-Based Action Recognition

Skeleton Focused Human Activity Recognition in RGB Video

Shifting Perspective to See Difference: A Novel Multi-View Method for Skeleton Based Action Recognition

Modelling Human Body Pose for Action Recognition Using Deep Neural Networks

HDBN: A Novel Hybrid Dual-branch Network for Robust Skeleton-based Action Recognition

Multimodal human action recognition based on spatio-temporal action representation recognition model

Multimodal Fusion via Teacher-Student Network for Indoor Action Recognition

Multi-Modality Co-Learning for Efficient Skeleton-based Action Recognition

Modality Compensation Network: Cross-Modal Adaptation for Action Recognition

Human Action Recognition Using Deep Multilevel Multimodal (M2) Fusion of Depth and Inertial Sensors

[Carbapenem antibiotics].

A Multi-Task Neural Network for Action Recognition with 3D Key-Points.