Abstract:Skeleton-based action representation learning aims to interpret and understand human behaviors by encoding the skeleton sequences, which can be categorized into two primary training paradigms: supervised learning and self-supervised learning. However, the former one-hot classification requires labor-intensive predefined action categories annotations, while the latter involves skeleton transformations (e.g., cropping) in the pretext tasks that may impair the skeleton structure. To address these challenges, we introduce a novel skeleton-based training framework (C$^2$VL) based on Cross-modal Contrastive learning that uses the progressive distillation to learn task-agnostic human skeleton action representation from the Vision-Language knowledge prompts. Specifically, we establish the vision-language action concept space through vision-language knowledge prompts generated by pre-trained large multimodal models (LMMs), which enrich the fine-grained details that the skeleton action space lacks. Moreover, we propose the intra-modal self-similarity and inter-modal cross-consistency softened targets in the cross-modal representation learning process to progressively control and guide the degree of pulling vision-language knowledge prompts and corresponding skeletons closer. These soft instance discrimination and self-knowledge distillation strategies contribute to the learning of better skeleton-based action representations from the noisy skeleton-vision-language pairs. During the inference phase, our method requires only the skeleton data as the input for action recognition and no longer for vision-language prompts. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that our method outperforms the previous methods and achieves state-of-the-art results. Code is available at: <a class="link-external link-https" href="https://github.com/cseeyangchen/C2VL" rel="external noopener nofollow">this https URL</a>.

I$^2$MD: 3D Action Representation Learning with Inter- and Intra-modal Mutual Distillation

CMD: Self-supervised 3D Action Representation Learning with Cross-Modal Mutual Distillation

Cross-modality Online Distillation for Multi-View Action Recognition

Learning and Distillating the Internal Relationship of Motion Features in Action Recognition.

Vision-Language Meets the Skeleton: Progressively Distillation with Cross-Modal Knowledge for 3D Action Representation Learning

Is Contrastive Distillation Enough for Learning Comprehensive 3D Representations?

Modality Distillation with Multiple Stream Networks for Action Recognition

Mutual Information Driven Equivariant Contrastive Learning for 3D Action Representation Learning

Multi-modal Relation Distillation for Unified 3D Representation Learning

Farewell to Mutual Information: Variational Distillation for Cross-Modal Person Re-Identification

Multi-view Distillation based on Multi-modal Fusion for Few-shot Action Recognition(CLIP-$\mathrm{M^2}$DF)

Module-wise Adaptive Distillation for Multimodality Foundation Models

Prompted Contrast with Masked Motion Modeling: Towards Versatile 3D Action Representation Learning

Incomplete Multimodal Industrial Anomaly Detection via Cross-Modal Distillation

DMCL: Distillation Multiple Choice Learning for Multimodal Action Recognition

Unlock the Power: Competitive Distillation for Multi-Modal Large Language Models

Improving Self-Supervised Action Recognition from Extremely Augmented Skeleton Sequences

SMTDKD: A Semantic-Aware Multimodal Transformer Fusion Decoupled Knowledge Distillation Method for Action Recognition

Multimodal Molecular Pretraining via Modality Blending

Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning

One-stage Modality Distillation for Incomplete Multimodal Learning