Abstract:Zero-shot skeleton-based action recognition aims to recognize actions of unseen categories after training on data of seen categories. The key is to build the connection between visual and semantic space from seen to unseen classes. Previous studies have primarily focused on encoding sequences into a singular feature vector, with subsequent mapping the features to an identical anchor point within the embedded space. Their performance is hindered by 1) the ignorance of the global visual/semantic distribution alignment, which results in a limitation to capture the true interdependence between the two spaces. 2) the negligence of temporal information since the frame-wise features with rich action clues are directly pooled into a single feature vector. We propose a new zero-shot skeleton-based action recognition method via mutual information (MI) estimation and maximization. Specifically, 1) we maximize the MI between visual and semantic space for distribution alignment; 2) we leverage the temporal information for estimating the MI by encouraging MI to increase as more frames are observed. Extensive experiments on three large-scale skeleton action datasets confirm the effectiveness of our method. Code: <a class="link-external link-https" href="https://github.com/YujieOuO/SMIE" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is zero - shot skeleton - based action recognition, that is, only using the data of seen classes during the training process and being able to recognize the actions of unseen classes during the test. Specifically, the paper aims to establish a connection between the visual space and the semantic space, thereby achieving knowledge transfer from seen classes to unseen classes.
### Main Problems and Challenges
1. **Neglect of Global Distribution Alignment**: Previous methods mainly focused on encoding action sequences into a single feature vector and mapping it to the same anchor point in the embedding space. This method ignores the global distribution alignment between the visual and semantic spaces, making it difficult to capture the true interdependence between the two.
2. **Ignoring of Temporal Information**: Frame - level features contain rich action cues, but these features are directly pooled into a single feature vector, ignoring temporal information, which causes information loss for action recognition.
### Solutions
To solve the above problems, the paper proposes a new method (SMIE) based on mutual information estimation and maximization, which specifically includes two modules:
1. **Global Alignment Module**:
- Align the distributions of the visual and semantic spaces by maximizing the mutual information \( I(V; A) \).
- Use Jensen - Shannon divergence (JSD) as an estimator to maximize the mutual information between paired visual and semantic features while minimizing the mutual information between unpaired features.
- The definition of mutual information is as follows:
\[
I(V; A)=D_{KL}(p(v, a) || p(v)p(a)) = E_{p(v, a)} \left[ \log \frac{p(v|a)}{p(v)} \right]
\]
- Where \( D_{KL} \) is the Kullback - Leibler divergence, \( p(v, a) \) is the joint distribution, and \( p(v)p(a) \) is the product of the marginal distributions.
2. **Temporal Constraint Module**:
- Use temporal information to estimate mutual information and encourage the mutual information to gradually increase as more frames are observed.
- Propose a bidirectional motion attention mechanism to enhance the importance of key frames and calculate the bidirectional motion of each frame:
\[
p_k=(p_{k,j,c}^{\text{next}})^2+(p_{k,j,c}^{\text{pre}})^2
\]
- Calculate the average motion value of each frame:
\[
p_k = \frac{1}{J \times C} \sum_{j = 1}^{J} \sum_{c = 1}^{C} p_{k,j,c}
\]
- Obtain the overall motion rate as the bidirectional attention weight:
\[
q_k=\frac{p_k}{\sum_{i = 1}^{K} p_i}
\]
- Select the top \( P \) frames with the highest attention scores as key frames and construct an attention - masked sample sequence.
### Summary
Through these two modules, the SMIE method not only aligns the global distributions of the visual and semantic spaces but also makes full use of the temporal dynamic information of actions, thereby improving the performance of zero - shot skeleton - based action recognition.