Abstract:Skeleton-based action recognition has garnered significant attention due to the utilization of concise and resilient skeletons. Nevertheless, the absence of detailed body information in skeletons restricts performance, while other multimodal methods require substantial inference resources and are inefficient when using multimodal data during both training and inference stages. To address this and fully harness the complementary multimodal features, we propose a novel multi-modality co-learning (MMCL) framework by leveraging the multimodal large language models (LLMs) as auxiliary networks for efficient skeleton-based action recognition, which engages in multi-modality co-learning during the training stage and keeps efficiency by employing only concise skeletons in inference. Our MMCL framework primarily consists of two modules. First, the Feature Alignment Module (FAM) extracts rich RGB features from video frames and aligns them with global skeleton features via contrastive learning. Second, the Feature Refinement Module (FRM) uses RGB images with temporal information and text instruction to generate instructive features based on the powerful generalization of multimodal LLMs. These instructive text features will further refine the classification scores and the refined scores will enhance the model's robustness and generalization in a manner similar to soft labels. Extensive experiments on NTU RGB+D, NTU RGB+D 120 and Northwestern-UCLA benchmarks consistently verify the effectiveness of our MMCL, which outperforms the existing skeleton-based action recognition methods. Meanwhile, experiments on UTD-MHAD and SYSU-Action datasets demonstrate the commendable generalization of our MMCL in zero-shot and domain-adaptive action recognition. Our code is publicly available at: <a class="link-external link-https" href="https://github.com/liujf69/MMCL-Action" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper attempts to address the issue of performance limitations in skeletal action recognition when using only the skeletal modality for training and inference. While multi-modal methods can improve performance, they require substantial inference resources and are inefficient. Specifically, the skeletal modality lacks detailed posture information (such as appearance and objects), making it difficult to achieve fine-grained recognition when dealing with similar actions. To solve these problems, the paper proposes a Multi-Modal Collaborative Learning (MMCL) framework, which aims to enhance the learning of skeletal features by introducing multi-modal data during the training phase, while using only the concise skeletal modality during the inference phase to maintain efficiency. The main contributions of the paper include: 1. Proposing a new Multi-Modal Collaborative Learning (MMCL) framework that enhances the robustness and generalization ability of mainstream Graph Convolutional Network (GCN) models through multi-modal collaborative learning during the training phase, while using only the concise skeletal modality during the inference phase to maintain efficiency. 2. Introducing multi-modal large language models (LLMs) into multi-modal collaborative learning for skeletal action recognition for the first time. The MMCL framework is orthogonal to different backbone networks, can be applied to optimize mainstream GCN models, and due to the generalization ability of multi-modal LLMs, MMCL can be transferred to domain adaptation and zero-shot action recognition tasks. 3. Conducting extensive experiments on three popular datasets (NTU RGB+D, NTU RGB+D 120, and Northwestern-UCLA) to verify the effectiveness of the MMCL framework, with performance surpassing existing skeletal-based action recognition methods. Additionally, experiments on the SYSU-ACTION and UTD-MHAD datasets in different domains show that MMCL also performs well in domain adaptation and zero-shot action recognition tasks. Through these contributions, the proposed method not only improves the performance and generalization ability of skeletal action recognition but also makes significant progress in maintaining efficiency.

Multi-Modality Co-Learning for Efficient Skeleton-based Action Recognition

Skeleton-Indexed Deep Multi-Modal Feature Learning for High Performance Human Action Recognition

Human-centric multimodal fusion network for robust action recognition

MS<SUP>2</SUP>L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition

Temporal Cues Enhanced Multimodal Learning for Action Recognition in RGB-D Videos

A Bidirectional Separated Distillation-Based Cross-Modal Interactive Fusion Network for Skeleton-Based Action Recognition

When Skeleton Meets Motion: Adaptive Multimodal Graph Representation Fusion for Action Recognition

Multi-source Learning for Skeleton -Based Action Recognition Using Deep LSTM Networks

Multi-Modality Adaptive Feature Fusion Graph Convolutional Network for Skeleton-Based Action Recognition

Multisource Learning for Skeleton-Based Action Recognition Using Deep LSTM and CNN

Multi-scale motion contrastive learning for self-supervised skeleton-based action recognition

Vision-Language Meets the Skeleton: Progressively Distillation with Cross-Modal Knowledge for 3D Action Representation Learning

Multi-Scale Enhanced Active Learning for Skeleton-Based Action Recognition

Action Recognition Based on 3D Skeleton and RGB Frame Fusion

Modality Compensation Network: Cross-Modal Adaptation for Action Recognition

Skeleton Focused Human Activity Recognition in RGB Video

Skeleton Sequence and RGB Frame Based Multi-Modality Feature Fusion Network for Action Recognition

Elevating Skeleton-Based Action Recognition with Efficient Multi-Modality Self-Supervision

Skeleton-based Action Recognition via Adaptive Cross-Form Learning

Attention-Based Multilevel Co-Occurrence Graph Convolutional LSTM for 3-D Action Recognition

A Key Skeleton Points Guided Classroom Action Recognition Method Based on Multimodal Symmetry Fusion