Wavelet-Decoupling Contrastive Enhancement Network for Fine-Grained Skeleton-Based Action Recognition

Haochen Chang,Jing Chen,Yilin Li,Jixiang Chen,Xiaofeng Zhang
DOI: https://doi.org/10.1109/ICASSP48485.2024.10448199
2024-02-04
Abstract:Skeleton-based action recognition has attracted much attention, benefiting from its succinctness and robustness. However, the minimal inter-class variation in similar action sequences often leads to confusion. The inherent spatiotemporal coupling characteristics make it challenging to mine the subtle differences in joint motion trajectories, which is critical for distinguishing confusing fine-grained actions. To alleviate this problem, we propose a Wavelet-Attention Decoupling (WAD) module that utilizes discrete wavelet transform to effectively disentangle salient and subtle motion features in the time-frequency domain. Then, the decoupling attention adaptively recalibrates their temporal responses. To further amplify the discrepancies in these subtle motion features, we propose a Fine-grained Contrastive Enhancement (FCE) module to enhance attention towards trajectory features by contrastive learning. Extensive experiments are conducted on the coarse-grained dataset NTU RGB+D and the fine-grained dataset FineGYM. Our methods perform competitively compared to state-of-the-art methods and can discriminate confusing fine-grained actions well.
Computer Vision and Pattern Recognition,Multimedia
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the confusion problem caused by the minimum inter - category differences between similar action sequences in fine - grained action recognition. Specifically, due to the inherent spatio - temporal coupling characteristics in fine - grained action recognition, it is difficult to mine the subtle differences in joint motion trajectories, and these differences are crucial for distinguishing similar actions. To solve this problem, the authors propose a Wavelet - Attention Decoupling (WAD) module based on wavelet transform and a Fine - grained Contrastive Enhancement (FCE) module, aiming to effectively decouple the significant and subtle motion features and enhance the discriminative ability of the subtle features through contrastive learning. ### Main contributions of the paper 1. **Wavelet - Attention Decoupling (WAD) module**: - Utilize the Discrete Wavelet Transform (DWT) to effectively decouple the significant and subtle motion features in the time - frequency domain. - Adaptively recalibrate the time responses of the decoupled features through a parameterized decoupling attention mechanism. 2. **Fine - grained Contrastive Enhancement (FCE) module**: - Amplify the differences of the subtle motion features by capturing the correlations between trajectory features. - Use the Prototype Contrastive Loss to guide the learning process of trajectory attention and enhance the discriminative ability of different action categories. 3. **Experimental verification**: - Extensive experiments were carried out on the NTU RGB + D and FineGYM datasets, and the results show that this method has significant advantages in the fine - grained action recognition task, especially in distinguishing easily confused fine - grained actions. ### Method overview 1. **Feature extraction**: - Use a feature extraction backbone network containing 3 ST - GC layers and 6 SSA - Tformer layers to map the skeleton features to the embedding space. 2. **Wavelet - Attention Decoupling (WAD) module**: - Reshape the feature map \(X_{\text{embed}}\) to \(X_{\text{embed}}\in\mathbb{R}^{N\times V\times C\times T}\), where \(N\) is the batch size, \(C\) is the channel dimension, \(T\) is the number of frames in the skeleton sequence, and \(V\) is the number of joints. - Apply 1D Discrete Wavelet Transform (DWT) to decompose the features into low - frequency and high - frequency components \(X_{\text{low}}\) and \(X_{\text{high}}\). - Adaptively recalibrate the time responses of the low - frequency and high - frequency components through the decoupling attention mechanism to obtain the decoupled significant feature \(X_{\text{salient}}\) and the subtle feature \(X_{\text{subtle}}\). 3. **Fine - grained Contrastive Enhancement (FCE) module**: - Design a Trajectory - wise Attention block to amplify the differences of the trajectory features. - Use the Prototype Contrastive Loss to guide the learning process of trajectory attention and enhance the discriminative ability of the subtle features. 4. **Feature fusion and training objective**: - Add the significant feature and the subtle feature together for fusion to obtain the final fused feature \(X_{\text{fuse}}\). - Use cross - entropy loss to train \(X_{\text{fuse}}\) and \(X_{\text{salient}}\), and use prototype contrastive loss to train \(X_{\text{subtle}}\). ### Experimental results - In NTU