STARS: Self-supervised Tuning for 3D Action Recognition in Skeleton Sequences

Soroush Mehraban,Mohammad Javad Rajabi,Babak Taati
2024-07-16
Abstract:Self-supervised pretraining methods with masked prediction demonstrate remarkable within-dataset performance in skeleton-based action recognition. However, we show that, unlike contrastive learning approaches, they do not produce well-separated clusters. Additionally, these methods struggle with generalization in few-shot settings. To address these issues, we propose Self-supervised Tuning for 3D Action Recognition in Skeleton sequences (STARS). Specifically, STARS first uses a masked prediction stage using an encoder-decoder architecture. It then employs nearest-neighbor contrastive learning to partially tune the weights of the encoder, enhancing the formation of semantic clusters for different actions. By tuning the encoder for a few epochs, and without using hand-crafted data augmentations, STARS achieves state-of-the-art self-supervised results in various benchmarks, including NTU-60, NTU-120, and PKU-MMD. In addition, STARS exhibits significantly better results than masked prediction models in few-shot settings, where the model has not seen the actions throughout pretraining. Project page: <a class="link-external link-https" href="https://soroushmehraban.github.io/stars/" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? The paper "STARS: Self - supervised Tuning for 3D Action Recognition in Skeleton Sequences" aims to solve the following two main problems: 1. **Limitations of existing self - supervised pre - training methods**: - **Poor clustering effect**: Although existing self - supervised pre - training methods based on Masked Prediction perform well in internal dataset evaluations, the action - class clusters they generate are not well - separated, resulting in low discrimination between different actions. - **Insufficient generalization ability**: These methods perform poorly in few - shot settings and cannot be well generalized to unseen action classes. 2. **Improve the performance and generalization ability of 3D skeleton - sequence action recognition**: - The paper proposes a new framework, STARS (Self - supervised Tuning for 3D Action Recognition in Skeleton Sequences), which combines Masked Auto - Encoder (MAE) and Contrastive Learning methods to improve the effect of self - supervised pre - training. - STARS first uses the masked prediction phase and then adjusts the encoder weights through the Nearest - Neighbor Contrastive Learning (NNCLR) part to enhance the semantic clustering of different actions, thereby improving the model's generalization ability and recognition accuracy. ### Specific problems and solutions - **Problem**: Existing self - supervised methods perform poorly in few - shot learning scenarios and the generated action - class clusters are not well - separated. - **Solution**: STARS enhances the learning ability of the encoder by introducing the contrastive learning phase, enabling the model to better recognize unseen actions in few - shot learning scenarios and form clearer action - class clusters. ### Main contributions 1. **Proposed the STARS framework**: This is a sequential self - supervised framework that significantly improves the performance and generalization ability of action recognition through MAE pre - training and contrastive learning fine - tuning. 2. **Improved the performance in few - shot learning**: Although the MAE method performs well in internal dataset evaluations, it performs poorly in few - shot learning scenarios. STARS significantly improves the ability in few - shot learning by combining contrastive learning while maintaining the advantages of the MAE method in internal dataset evaluations. 3. **Verified the effectiveness of the method**: Through experiments on multiple large - scale 3D skeleton - action - recognition datasets, it is proved that STARS achieves state - of - the - art performance in most cases. ### Summary The main goal of this paper is to solve the deficiencies of existing self - supervised pre - training methods in clustering effect and generalization ability by combining Masked Auto - Encoder and Contrastive Learning methods, thereby improving the performance of 3D skeleton - sequence action recognition.