Abstract:Zero-shot skeleton-based action recognition aims to recognize actions of unseen categories after training on data of seen categories. The key is to build the connection between visual and semantic space from seen to unseen classes. Previous studies have primarily focused on encoding sequences into a singular feature vector, with subsequent mapping the features to an identical anchor point within the embedded space. Their performance is hindered by 1) the ignorance of the global visual/semantic distribution alignment, which results in a limitation to capture the true interdependence between the two spaces. 2) the negligence of temporal information since the frame-wise features with rich action clues are directly pooled into a single feature vector. We propose a new zero-shot skeleton-based action recognition method via mutual information (MI) estimation and maximization. Specifically, 1) we maximize the MI between visual and semantic space for distribution alignment; 2) we leverage the temporal information for estimating the MI by encouraging MI to increase as more frames are observed. Extensive experiments on three large-scale skeleton action datasets confirm the effectiveness of our method. Code: <a class="link-external link-https" href="https://github.com/YujieOuO/SMIE" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is zero - shot skeleton - based action recognition, that is, only using the data of seen classes during the training process and being able to recognize the actions of unseen classes during the test. Specifically, the paper aims to establish a connection between the visual space and the semantic space, thereby achieving knowledge transfer from seen classes to unseen classes. ### Main Problems and Challenges 1. **Neglect of Global Distribution Alignment**: Previous methods mainly focused on encoding action sequences into a single feature vector and mapping it to the same anchor point in the embedding space. This method ignores the global distribution alignment between the visual and semantic spaces, making it difficult to capture the true interdependence between the two. 2. **Ignoring of Temporal Information**: Frame - level features contain rich action cues, but these features are directly pooled into a single feature vector, ignoring temporal information, which causes information loss for action recognition. ### Solutions To solve the above problems, the paper proposes a new method (SMIE) based on mutual information estimation and maximization, which specifically includes two modules: 1. **Global Alignment Module**: - Align the distributions of the visual and semantic spaces by maximizing the mutual information \( I(V; A) \). - Use Jensen - Shannon divergence (JSD) as an estimator to maximize the mutual information between paired visual and semantic features while minimizing the mutual information between unpaired features. - The definition of mutual information is as follows: \[ I(V; A)=D_{KL}(p(v, a) || p(v)p(a)) = E_{p(v, a)} \left[ \log \frac{p(v|a)}{p(v)} \right] \] - Where \( D_{KL} \) is the Kullback - Leibler divergence, \( p(v, a) \) is the joint distribution, and \( p(v)p(a) \) is the product of the marginal distributions. 2. **Temporal Constraint Module**: - Use temporal information to estimate mutual information and encourage the mutual information to gradually increase as more frames are observed. - Propose a bidirectional motion attention mechanism to enhance the importance of key frames and calculate the bidirectional motion of each frame: \[ p_k=(p_{k,j,c}^{\text{next}})^2+(p_{k,j,c}^{\text{pre}})^2 \] - Calculate the average motion value of each frame: \[ p_k = \frac{1}{J \times C} \sum_{j = 1}^{J} \sum_{c = 1}^{C} p_{k,j,c} \] - Obtain the overall motion rate as the bidirectional attention weight: \[ q_k=\frac{p_k}{\sum_{i = 1}^{K} p_i} \] - Select the top \( P \) frames with the highest attention scores as key frames and construct an attention - masked sample sequence. ### Summary Through these two modules, the SMIE method not only aligns the global distributions of the visual and semantic spaces but also makes full use of the temporal dynamic information of actions, thereby improving the performance of zero - shot skeleton - based action recognition.

Zero-shot Skeleton-based Action Recognition via Mutual Information Estimation and Maximization

MKTZ: multi-semantic embedding and key frame masking techniques for zero-shot skeleton action recognition

An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition

Zero-Shot Skeleton-based Action Recognition with Dual Visual-Text Alignment

Shifting Perspective to See Difference: A Novel Multi-View Method for Skeleton Based Action Recognition

Action Recognition Based on Global Optimal Similarity Measuring

One-Shot Action Recognition via Multi-Scale Spatial-Temporal Skeleton Matching

Fine-Grained Side Information Guided Dual-Prompts for Zero-Shot Skeleton Action Recognition

Multi-Semantic Fusion Model for Generalized Zero-Shot Skeleton-Based Action Recognition

SA-DVAE: Improving Zero-Shot Skeleton-Based Action Recognition by Disentangled Variational Autoencoders

Zero-shot Action Recognition Via Empirical Maximum Mean Discrepancy

Prompt-Guided Zero-Shot Anomaly Action Recognition using Pretrained Deep Skeleton Features

Learning Spatial-Preserved Skeleton Representations for Few-Shot Action Recognition.

Motion-Aware Mask Feature Reconstruction for Skeleton-Based Action Recognition

Fusing Shape and Motion Matrices for View Invariant Action Recognition Using 3D Skeletons

Joint Embedding with Multi-Task Learning for Multi-Label Zero-Shot Action Recognition

Multisource Learning for Skeleton-Based Action Recognition Using Deep LSTM and CNN

Semantic Embedding Space for Zero-Shot Action Recognition

Skeleton-based Attention-Aware Spatial-Temporal Model for Action Detection and Recognition.

Skeleton MixFormer: Multivariate Topology Representation for Skeleton-based Action Recognition