Abstract:Skeleton-based action representation learning aims to interpret and understand human behaviors by encoding the skeleton sequences, which can be categorized into two primary training paradigms: supervised learning and self-supervised learning. However, the former one-hot classification requires labor-intensive predefined action categories annotations, while the latter involves skeleton transformations (e.g., cropping) in the pretext tasks that may impair the skeleton structure. To address these challenges, we introduce a novel skeleton-based training framework (C$^2$VL) based on Cross-modal Contrastive learning that uses the progressive distillation to learn task-agnostic human skeleton action representation from the Vision-Language knowledge prompts. Specifically, we establish the vision-language action concept space through vision-language knowledge prompts generated by pre-trained large multimodal models (LMMs), which enrich the fine-grained details that the skeleton action space lacks. Moreover, we propose the intra-modal self-similarity and inter-modal cross-consistency softened targets in the cross-modal representation learning process to progressively control and guide the degree of pulling vision-language knowledge prompts and corresponding skeletons closer. These soft instance discrimination and self-knowledge distillation strategies contribute to the learning of better skeleton-based action representations from the noisy skeleton-vision-language pairs. During the inference phase, our method requires only the skeleton data as the input for action recognition and no longer for vision-language prompts. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that our method outperforms the previous methods and achieves state-of-the-art results. Code is available at: <a class="link-external link-https" href="https://github.com/cseeyangchen/C2VL" rel="external noopener nofollow">this https URL</a>.

Open-Vocabulary Skeleton Action Recognition with Diffusion Graph Convolutional Network and Pre-Trained Vision-Language Models

DD-GCN: Directed Diffusion Graph Convolutional Network for Skeleton-based Human Action Recognition

Zero-Shot Skeleton-based Action Recognition with Dual Visual-Text Alignment

Multidimensional Refinement Graph Convolutional Network With Robust Decouple Loss for Fine-Grained Skeleton-Based Action Recognition

Channel-Wise Dense Connection Graph Convolutional Network for Skeleton-Based Action Recognition

Skeleton-based Action Recognition via Adaptive Cross-Form Learning

Prompt-supervised dynamic attention graph convolutional network for skeleton-based action recognition

Multi-Dimensional Refinement Graph Convolutional Network with Robust Decouple Loss for Fine-Grained Skeleton-Based Action Recognition

Spatio-Temporal Inception Graph Convolutional Networks for Skeleton-Based Action Recognition.

Vision-Language Meets the Skeleton: Progressively Distillation with Cross-Modal Knowledge for 3D Action Representation Learning

A Novel Contrastive Diffusion Graph Convolutional Network for Few-Shot Skeleton-Based Action Recognition

Skeleton action recognition via graph convolutional network with self-attention module

Forward-reverse adaptive graph convolutional networks for skeleton-based action recognition

An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition

A Tri-Attention Enhanced Graph Convolutional Network for Skeleton-Based Action Recognition

Feature reconstruction graph convolutional network for skeleton-based action recognition

An improved spatial temporal graph convolutional network for robust skeleton-based action recognition

Skeleton-Based Action Recognition With Low-Level Features of Adaptive Graph Convolutional Networks

Focusing and Diffusion: Bidirectional Attentive Graph Convolutional Networks for Skeleton-based Action Recognition

Attention-Guided and Topology-Enhanced Shift Graph Convolutional Network for Skeleton-Based Action Recognition

Pose-Guided Graph Convolutional Networks for Skeleton-Based Action Recognition