Abstract:Skeleton-based zero-shot action recognition aims to recognize unknown human actions based on the learned priors of the known skeleton-based actions and a semantic descriptor space shared by both known and unknown categories. However, previous works focus on establishing the bridges between the known skeleton representation space and semantic descriptions space at the coarse-grained level for recognizing unknown action categories, ignoring the fine-grained alignment of these two spaces, resulting in suboptimal performance in distinguishing high-similarity action categories. To address these challenges, we propose a novel method via Side information and dual-prompts learning for skeleton-based zero-shot action recognition (STAR) at the fine-grained level. Specifically, 1) we decompose the skeleton into several parts based on its topology structure and introduce the side information concerning multi-part descriptions of human body movements for alignment between the skeleton and the semantic space at the fine-grained level; 2) we design the visual-attribute and semantic-part prompts to improve the intra-class compactness within the skeleton space and inter-class separability within the semantic space, respectively, to distinguish the high-similarity actions. Extensive experiments show that our method achieves state-of-the-art performance in ZSL and GZSL settings on NTU RGB+D, NTU RGB+D 120, and PKU-MMD datasets.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve several key challenges in skeleton - based zero - shot action recognition (Zero - Shot Skeleton Action Recognition, ZSL). Specifically, existing methods have the following problems when dealing with unknown action categories: 1. **Insufficient fine - grained alignment**: Existing methods mainly focus on building a bridge between the known skeleton representation space and the semantic description space at a coarse - grained level, ignoring the fine - grained alignment between these two spaces, resulting in poor performance in distinguishing action categories with high similarity. 2. **Poor intra - class compactness and inter - class separability**: The pre - extracted skeleton representations and semantic embeddings have low intra - class compactness and inter - class separability in their respective feature spaces, making it difficult to distinguish highly similar actions. 3. **Limited generalization ability**: In generalized zero - shot learning (Generalized Zero - Shot Learning, GZSL), previous studies classify unknown categories through the probability distribution of known categories. This method is unreasonable and limits the generalization ability of the model. To address these challenges, this paper proposes a new method - Side Information and Dual - Prompts Learning for Zero - Shot Skeleton Action Recognition (STAR) to achieve fine - grained - level alignment and better action recognition performance. ### Solutions 1. **Fine - grained skeleton decomposition**: This paper decomposes the skeleton sequence into multiple parts according to the human body topology and introduces the motion description of each part as side information, thereby aligning the skeleton and semantic spaces at a fine - grained level. 2. **Visual attribute prompts and semantic part prompts**: - **Visual attribute prompts**: Visual attribute prompts are designed to explore the spatio - temporal features of skeleton parts through the cross - attention mechanism and improve intra - class compactness. - **Semantic part prompts**: Semantic part prompts are introduced to further improve the inter - class separability of side information in the semantic space. 3. **Multi - loss optimization**: The model training is guided by multi - part cross - entropy loss, semantic cross - entropy loss and global cross - entropy loss to ensure the effective alignment of the skeleton and semantic spaces at the fine - grained and global levels. ### Experimental results The experimental results show that the STAR method proposed in this paper achieves state - of - the - art performance in both ZSL and GZSL settings on the NTU RGB + D, NTU RGB + D 120 and PKU - MMD datasets. Especially in cross - subject tasks, the STAR method shows significant performance improvement under different known - unknown category division strategies. ### Main contributions 1. **Introduction of side information**: By introducing the motion description of skeleton parts as side information, the spatio - temporal information of category names is enriched, and fine - grained - level alignment is achieved. 2. **Design of dual - prompts**: Visual attribute prompts and semantic part prompts are proposed, which respectively improve the intra - class compactness and inter - class separability among action categories and are helpful for identifying highly similar actions. 3. **Extensive experimental verification**: Extensive experiments are carried out on multiple datasets, which prove the superior performance of the proposed method in ZSL and GZSL settings. ### Formula summary - **Multi - head cross - attention mechanism**: \[ \text{Att}_v^h=\text{softmax}\left(\frac{Q_vK_v^T}{\sqrt{d}}\right)V_v \] \[ \text{ffe}_v = \text{concat}(\text{Att}_v^1,\ldots,\text{Att}_v^n)W_o \] - **Multi - part cross - entropy loss**: \[ L_{\text{MPCE}}=-\frac{1}{B\times K}\sum_{i = 1}^{B}\sum_{e = 1}^{K}\log\left(\frac{\exp(F_{i,e}^v\cdot F_e^{\text{

Fine-Grained Side Information Guided Dual-Prompts for Zero-Shot Skeleton Action Recognition

Shifting Perspective to See Difference: A Novel Multi-View Method for Skeleton Based Action Recognition

An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition

Zero-Shot Skeleton-based Action Recognition with Dual Visual-Text Alignment

Zero-shot Skeleton-based Action Recognition via Mutual Information Estimation and Maximization

Neuron: Learning Context-Aware Evolving Representations for Zero-Shot Skeleton Action Recognition

Part-aware Unified Representation of Language and Skeleton for Zero-shot Action Recognition

Action Recognition Based on Global Optimal Similarity Measuring

Learning Discriminative Representations for Skeleton Based Action Recognition

MKTZ: multi-semantic embedding and key frame masking techniques for zero-shot skeleton action recognition

Learning Spatial-Preserved Skeleton Representations for Few-Shot Action Recognition.

Part-aware Prototypical Graph Network for One-shot Skeleton-based Action Recognition

Prompt-supervised dynamic attention graph convolutional network for skeleton-based action recognition

SA-DVAE: Improving Zero-Shot Skeleton-Based Action Recognition by Disentangled Variational Autoencoders

2D human skeleton action recognition with spatial constraints

One-Shot Action Recognition via Multi-Scale Spatial-Temporal Skeleton Matching

STSD: spatial–temporal semantic decomposition transformer for skeleton-based action recognition

Action Recognition Based on Multi-Level Topological Channel Attention of Human Skeleton

Skeleton-Based Action Recognition with Spatial Reasoning and Temporal Stack Learning

Revealing Key Details to See Differences: A Novel Prototypical Perspective for Skeleton-based Action Recognition