Abstract:While remarkable progress has been made on supervised skeleton-based action recognition, the challenge of zero-shot recognition remains relatively unexplored. In this paper, we argue that relying solely on aligning label-level semantics and global skeleton features is insufficient to effectively transfer locally consistent visual knowledge from seen to unseen classes. To address this limitation, we introduce Part-aware Unified Representation between Language and Skeleton (PURLS) to explore visual-semantic alignment at both local and global scales. PURLS introduces a new prompting module and a novel partitioning module to generate aligned textual and visual representations across different levels. The former leverages a pre-trained GPT-3 to infer refined descriptions of the global and local (body-part-based and temporal-interval-based) movements from the original action labels. The latter employs an adaptive sampling strategy to group visual features from all body joint movements that are semantically relevant to a given description. Our approach is evaluated on various skeleton/language backbones and three large-scale datasets, i.e., NTU-RGB+D 60, NTU-RGB+D 120, and a newly curated dataset Kinetics-skeleton 200. The results showcase the universality and superior performance of PURLS, surpassing prior skeleton-based solutions and standard baselines from other domains. The source codes can be accessed at <a class="link-external link-https" href="https://github.com/azzh1/PURLS" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to address the challenges in **Zero - shot Action Recognition (ZSR)**, especially in skeleton - sequence - based action recognition. Specifically, existing methods mainly rely on the semantic alignment of global skeletal features and label - level semantics, which has limited effectiveness in recognizing unseen action categories. The authors point out that this method cannot effectively transfer locally consistent visual knowledge from seen categories to unseen categories. To solve this problem, the authors propose a new framework named **Part - aware Unified Representation of Language and Skeleton (PURLS)**. PURLS achieves cross - modal semantic alignment of language and skeletal features at global and local scales by introducing the **Local - awareness Module** and the **Adaptive Partitioning Module**. Specifically: 1. **Local - awareness Module**: Utilize a pre - trained language model (such as GPT - 3) to generate detailed descriptions, including global action descriptions and local action descriptions based on body parts and time intervals. 2. **Adaptive Partitioning Module**: Group skeletal joint features through an adaptive sampling strategy to generate visual representations that are semantically related to the given descriptions. In this way, PURLS can better capture and transfer the knowledge of local visual concepts, thereby improving the recognition performance of unseen categories. ### Main contributions of the paper 1. **Propose the PURLS framework**: Used to explore and align global and local visual concepts and combine rich semantic information for zero - shot action recognition. 2. **Adaptive weight learning**: Through the adaptive partitioning module, support the local knowledge transfer from seen categories to unseen categories. 3. **Extensive experimental verification**: Achieve state - of - the - art performance on multiple publicly available large - scale datasets, demonstrating its robustness and generalization ability. ### Formula summary - **Text embedding calculation**: \[ F = f_{\text{Dy}}=\text{ftext}(d)\in\mathbb{R}^{(P + Z+ 1)\times m} \] where \(P\) is the number of body parts, \(Z\) is the number of time intervals, and \(m\) is the size of the text embedding dimension. - **Attention matrix calculation**: \[ A=\text{softmax}\left(\frac{Q\times K^{T}}{\sqrt{h}}\right) \] where \(Q = FW_Q\), \(K = GW_K\), \(W_Q\in\mathbb{R}^{m\times h}\), \(W_K\in\mathbb{R}^{n\times h}\), and \(h\) is the projection dimension size. - **Visual representation calculation**: \[ R = AG \] - **Contrastive loss function**: \[ L(V_i,F_i)=-\frac{1}{2}\log\frac{\exp\left(\frac{V_iF_i}{\tau}\right)}{\sum_{o\in Y_{sc}}\exp\left(\frac{V_iF_o}{\tau}\right)}-\frac{1}{2}\log\frac{\exp\left(\frac{V_iF_i}{\tau}\right)}{\sum_{w\in\text{batch}}\exp\left(\frac{V_wF_i}{\tau}\right)} \] - **Total training loss**: \[ L_{\text{train}}(x,y)=\sum_{i = 0}^{P+Z}\alpha_iL(V_i,F_i) \] Through these formulas, PURLS can be in the training process.

Part-aware Unified Representation of Language and Skeleton for Zero-shot Action Recognition

Shifting Perspective to See Difference: A Novel Multi-View Method for Skeleton Based Action Recognition

Linguistic-Driven Partial Semantic Relevance Learning for Skeleton-Based Action Recognition

Unveiling the Hidden Realm: Self-supervised Skeleton-based Action Recognition in Occluded Environments

Fine-Grained Side Information Guided Dual-Prompts for Zero-Shot Skeleton Action Recognition

Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding

Language-Assisted Human Part Motion Learning for Skeleton-Based Temporal Action Segmentation

Part-aware Prototypical Graph Network for One-shot Skeleton-based Action Recognition

Zero-Shot Skeleton-based Action Recognition with Dual Visual-Text Alignment

Self-supervised Action Representation Learning from Partial Spatio-Temporal Skeleton Sequences

An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition

Exploring Self-Supervised Skeleton-Based Human Action Recognition under Occlusions

Zero-shot Skeleton-based Action Recognition via Mutual Information Estimation and Maximization

Learning Spatial-Preserved Skeleton Representations for Few-Shot Action Recognition.

Learning Discriminative and Robust Representations for UAV-View Skeleton-Based Action Recognition

MKTZ: multi-semantic embedding and key frame masking techniques for zero-shot skeleton action recognition

Pyramid Self-attention Polymerization Learning for Semi-supervised Skeleton-based Action Recognition

Prompt-Guided Zero-Shot Anomaly Action Recognition using Pretrained Deep Skeleton Features

Neuron: Learning Context-Aware Evolving Representations for Zero-Shot Skeleton Action Recognition

Navigating Open Set Scenarios for Skeleton-based Action Recognition

Skeleton-Based Mutually Assisted Interacted Object Localization and Human Action Recognition