Abstract:Skeleton-based action recognition has garnered significant attention due to the utilization of concise and resilient skeletons. Nevertheless, the absence of detailed body information in skeletons restricts performance, while other multimodal methods require substantial inference resources and are inefficient when using multimodal data during both training and inference stages. To address this and fully harness the complementary multimodal features, we propose a novel multi-modality co-learning (MMCL) framework by leveraging the multimodal large language models (LLMs) as auxiliary networks for efficient skeleton-based action recognition, which engages in multi-modality co-learning during the training stage and keeps efficiency by employing only concise skeletons in inference. Our MMCL framework primarily consists of two modules. First, the Feature Alignment Module (FAM) extracts rich RGB features from video frames and aligns them with global skeleton features via contrastive learning. Second, the Feature Refinement Module (FRM) uses RGB images with temporal information and text instruction to generate instructive features based on the powerful generalization of multimodal LLMs. These instructive text features will further refine the classification scores and the refined scores will enhance the model's robustness and generalization in a manner similar to soft labels. Extensive experiments on NTU RGB+D, NTU RGB+D 120 and Northwestern-UCLA benchmarks consistently verify the effectiveness of our MMCL, which outperforms the existing skeleton-based action recognition methods. Meanwhile, experiments on UTD-MHAD and SYSU-Action datasets demonstrate the commendable generalization of our MMCL in zero-shot and domain-adaptive action recognition. Our code is publicly available at: <a class="link-external link-https" href="https://github.com/liujf69/MMCL-Action" rel="external noopener nofollow">this https URL</a>.

Multi-source Learning for Skeleton -Based Action Recognition Using Deep LSTM Networks

Multisource Learning for Skeleton-Based Action Recognition Using Deep LSTM and CNN

Skeleton Feature Fusion Based on Multi-Stream LSTM for Action Recognition.

Learning Local Part Motion Representation for Skeleton-based Action Recognition

Skeleton-based Action Recognition Using LSTM and CNN

Fusing Geometric Features for Skeleton-Based Action Recognition Using Multilayer LSTM Networks

Exploring a Rich Spatial-Temporal Dependent Relational Model for Skeleton-Based Action Recognition by Bidirectional LSTM-CNN.

Skeleton-Indexed Deep Multi-Modal Feature Learning for High Performance Human Action Recognition

3D Action Recognition Using Multi-Temporal Skeleton Visualization.

An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition

Skeleton-based Attention-Aware Spatial-Temporal Model for Action Detection and Recognition.

Action Recognition Based on 3D Skeleton and RGB Frame Fusion

Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep cnn

Action Recognition Scheme Based on Skeleton Representation with DS-LSTM Network.

A New Representation of Skeleton Sequences for 3D Action Recognition

Learning to Recognize 3D Human Action from A New Skeleton-based Representation Using Deep Convolutional Neural Networks

Skeleton Sequence and RGB Frame Based Multi-Modality Feature Fusion Network for Action Recognition

Temporal Cues Enhanced Multimodal Learning for Action Recognition in RGB-D Videos

Multi-Modality Co-Learning for Efficient Skeleton-based Action Recognition

Skeleton-Based Human Action Recognition Using Spatial Temporal 3D Convolutional Neural Networks

Attention-Based Multiview Re-Observation Fusion Network for Skeletal Action Recognition.