Abstract:Skeleton-based action recognition has attracted much attention, benefiting from its succinctness and robustness. However, the minimal inter-class variation in similar action sequences often leads to confusion. The inherent spatiotemporal coupling characteristics make it challenging to mine the subtle differences in joint motion trajectories, which is critical for distinguishing confusing fine-grained actions. To alleviate this problem, we propose a Wavelet-Attention Decoupling (WAD) module that utilizes discrete wavelet transform to effectively disentangle salient and subtle motion features in the time-frequency domain. Then, the decoupling attention adaptively recalibrates their temporal responses. To further amplify the discrepancies in these subtle motion features, we propose a Fine-grained Contrastive Enhancement (FCE) module to enhance attention towards trajectory features by contrastive learning. Extensive experiments are conducted on the coarse-grained dataset NTU RGB+D and the fine-grained dataset FineGYM. Our methods perform competitively compared to state-of-the-art methods and can discriminate confusing fine-grained actions well.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the confusion problem caused by the minimum inter - category differences between similar action sequences in fine - grained action recognition. Specifically, due to the inherent spatio - temporal coupling characteristics in fine - grained action recognition, it is difficult to mine the subtle differences in joint motion trajectories, and these differences are crucial for distinguishing similar actions. To solve this problem, the authors propose a Wavelet - Attention Decoupling (WAD) module based on wavelet transform and a Fine - grained Contrastive Enhancement (FCE) module, aiming to effectively decouple the significant and subtle motion features and enhance the discriminative ability of the subtle features through contrastive learning. ### Main contributions of the paper 1. **Wavelet - Attention Decoupling (WAD) module**: - Utilize the Discrete Wavelet Transform (DWT) to effectively decouple the significant and subtle motion features in the time - frequency domain. - Adaptively recalibrate the time responses of the decoupled features through a parameterized decoupling attention mechanism. 2. **Fine - grained Contrastive Enhancement (FCE) module**: - Amplify the differences of the subtle motion features by capturing the correlations between trajectory features. - Use the Prototype Contrastive Loss to guide the learning process of trajectory attention and enhance the discriminative ability of different action categories. 3. **Experimental verification**: - Extensive experiments were carried out on the NTU RGB + D and FineGYM datasets, and the results show that this method has significant advantages in the fine - grained action recognition task, especially in distinguishing easily confused fine - grained actions. ### Method overview 1. **Feature extraction**: - Use a feature extraction backbone network containing 3 ST - GC layers and 6 SSA - Tformer layers to map the skeleton features to the embedding space. 2. **Wavelet - Attention Decoupling (WAD) module**: - Reshape the feature map \(X_{\text{embed}}\) to \(X_{\text{embed}}\in\mathbb{R}^{N\times V\times C\times T}\), where \(N\) is the batch size, \(C\) is the channel dimension, \(T\) is the number of frames in the skeleton sequence, and \(V\) is the number of joints. - Apply 1D Discrete Wavelet Transform (DWT) to decompose the features into low - frequency and high - frequency components \(X_{\text{low}}\) and \(X_{\text{high}}\). - Adaptively recalibrate the time responses of the low - frequency and high - frequency components through the decoupling attention mechanism to obtain the decoupled significant feature \(X_{\text{salient}}\) and the subtle feature \(X_{\text{subtle}}\). 3. **Fine - grained Contrastive Enhancement (FCE) module**: - Design a Trajectory - wise Attention block to amplify the differences of the trajectory features. - Use the Prototype Contrastive Loss to guide the learning process of trajectory attention and enhance the discriminative ability of the subtle features. 4. **Feature fusion and training objective**: - Add the significant feature and the subtle feature together for fusion to obtain the final fused feature \(X_{\text{fuse}}\). - Use cross - entropy loss to train \(X_{\text{fuse}}\) and \(X_{\text{salient}}\), and use prototype contrastive loss to train \(X_{\text{subtle}}\). ### Experimental results - In NTU

Wavelet-Decoupling Contrastive Enhancement Network for Fine-Grained Skeleton-Based Action Recognition

Multidimensional Refinement Graph Convolutional Network With Robust Decouple Loss for Fine-Grained Skeleton-Based Action Recognition

Multi-Dimensional Refinement Graph Convolutional Network with Robust Decouple Loss for Fine-Grained Skeleton-Based Action Recognition

An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition

Dual-Excitation Spatial–Temporal Graph Convolution Network for Skeleton-Based Action Recognition

An improved spatial temporal graph convolutional network for robust skeleton-based action recognition

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

SpatioTemporal Focus for Skeleton-based Action Recognition

Multiple temporal scale aggregation graph convolutional network for skeleton-based action recognition

Learning Discriminative Representations for Skeleton Based Action Recognition

A Tri-Attention Enhanced Graph Convolutional Network for Skeleton-Based Action Recognition

Motion Complement and Temporal Multifocusing for Skeleton-Based Action Recognition

Part Aware Contrastive Learning for Self-Supervised Action Recognition

Spatial Temporal Graph Attention Network for Skeleton-Based Action Recognition

Prompt-supervised dynamic attention graph convolutional network for skeleton-based action recognition

Multi‐temporal scale aggregation refinement graph convolutional network for skeleton‐based action recognition

Richly Activated Graph Convolutional Network for Robust Skeleton-Based Action Recognition

Spatiotemporal Decouple-and-Squeeze Contrastive Learning for Semi-Supervised Skeleton-based Action Recognition

Focusing and Diffusion: Bidirectional Attentive Graph Convolutional Networks for Skeleton-based Action Recognition

Multi-Scale Adaptive Graph Convolution Network for Skeleton-Based Action Recognition

Channel-Wise Dense Connection Graph Convolutional Network for Skeleton-Based Action Recognition