Part-aware Unified Representation of Language and Skeleton for Zero-shot Action Recognition

Anqi Zhu,Qiuhong Ke,Mingming Gong,James Bailey
2024-06-19
Abstract:While remarkable progress has been made on supervised skeleton-based action recognition, the challenge of zero-shot recognition remains relatively unexplored. In this paper, we argue that relying solely on aligning label-level semantics and global skeleton features is insufficient to effectively transfer locally consistent visual knowledge from seen to unseen classes. To address this limitation, we introduce Part-aware Unified Representation between Language and Skeleton (PURLS) to explore visual-semantic alignment at both local and global scales. PURLS introduces a new prompting module and a novel partitioning module to generate aligned textual and visual representations across different levels. The former leverages a pre-trained GPT-3 to infer refined descriptions of the global and local (body-part-based and temporal-interval-based) movements from the original action labels. The latter employs an adaptive sampling strategy to group visual features from all body joint movements that are semantically relevant to a given description. Our approach is evaluated on various skeleton/language backbones and three large-scale datasets, i.e., NTU-RGB+D 60, NTU-RGB+D 120, and a newly curated dataset Kinetics-skeleton 200. The results showcase the universality and superior performance of PURLS, surpassing prior skeleton-based solutions and standard baselines from other domains. The source codes can be accessed at <a class="link-external link-https" href="https://github.com/azzh1/PURLS" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to address the challenges in **Zero - shot Action Recognition (ZSR)**, especially in skeleton - sequence - based action recognition. Specifically, existing methods mainly rely on the semantic alignment of global skeletal features and label - level semantics, which has limited effectiveness in recognizing unseen action categories. The authors point out that this method cannot effectively transfer locally consistent visual knowledge from seen categories to unseen categories. To solve this problem, the authors propose a new framework named **Part - aware Unified Representation of Language and Skeleton (PURLS)**. PURLS achieves cross - modal semantic alignment of language and skeletal features at global and local scales by introducing the **Local - awareness Module** and the **Adaptive Partitioning Module**. Specifically: 1. **Local - awareness Module**: Utilize a pre - trained language model (such as GPT - 3) to generate detailed descriptions, including global action descriptions and local action descriptions based on body parts and time intervals. 2. **Adaptive Partitioning Module**: Group skeletal joint features through an adaptive sampling strategy to generate visual representations that are semantically related to the given descriptions. In this way, PURLS can better capture and transfer the knowledge of local visual concepts, thereby improving the recognition performance of unseen categories. ### Main contributions of the paper 1. **Propose the PURLS framework**: Used to explore and align global and local visual concepts and combine rich semantic information for zero - shot action recognition. 2. **Adaptive weight learning**: Through the adaptive partitioning module, support the local knowledge transfer from seen categories to unseen categories. 3. **Extensive experimental verification**: Achieve state - of - the - art performance on multiple publicly available large - scale datasets, demonstrating its robustness and generalization ability. ### Formula summary - **Text embedding calculation**: \[ F = f_{\text{Dy}}=\text{ftext}(d)\in\mathbb{R}^{(P + Z+ 1)\times m} \] where \(P\) is the number of body parts, \(Z\) is the number of time intervals, and \(m\) is the size of the text embedding dimension. - **Attention matrix calculation**: \[ A=\text{softmax}\left(\frac{Q\times K^{T}}{\sqrt{h}}\right) \] where \(Q = FW_Q\), \(K = GW_K\), \(W_Q\in\mathbb{R}^{m\times h}\), \(W_K\in\mathbb{R}^{n\times h}\), and \(h\) is the projection dimension size. - **Visual representation calculation**: \[ R = AG \] - **Contrastive loss function**: \[ L(V_i,F_i)=-\frac{1}{2}\log\frac{\exp\left(\frac{V_iF_i}{\tau}\right)}{\sum_{o\in Y_{sc}}\exp\left(\frac{V_iF_o}{\tau}\right)}-\frac{1}{2}\log\frac{\exp\left(\frac{V_iF_i}{\tau}\right)}{\sum_{w\in\text{batch}}\exp\left(\frac{V_wF_i}{\tau}\right)} \] - **Total training loss**: \[ L_{\text{train}}(x,y)=\sum_{i = 0}^{P+Z}\alpha_iL(V_i,F_i) \] Through these formulas, PURLS can be in the training process.