Abstract:Self-supervised pretraining methods with masked prediction demonstrate remarkable within-dataset performance in skeleton-based action recognition. However, we show that, unlike contrastive learning approaches, they do not produce well-separated clusters. Additionally, these methods struggle with generalization in few-shot settings. To address these issues, we propose Self-supervised Tuning for 3D Action Recognition in Skeleton sequences (STARS). Specifically, STARS first uses a masked prediction stage using an encoder-decoder architecture. It then employs nearest-neighbor contrastive learning to partially tune the weights of the encoder, enhancing the formation of semantic clusters for different actions. By tuning the encoder for a few epochs, and without using hand-crafted data augmentations, STARS achieves state-of-the-art self-supervised results in various benchmarks, including NTU-60, NTU-120, and PKU-MMD. In addition, STARS exhibits significantly better results than masked prediction models in few-shot settings, where the model has not seen the actions throughout pretraining. Project page: <a class="link-external link-https" href="https://soroushmehraban.github.io/stars/" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? The paper "STARS: Self - supervised Tuning for 3D Action Recognition in Skeleton Sequences" aims to solve the following two main problems: 1. **Limitations of existing self - supervised pre - training methods**: - **Poor clustering effect**: Although existing self - supervised pre - training methods based on Masked Prediction perform well in internal dataset evaluations, the action - class clusters they generate are not well - separated, resulting in low discrimination between different actions. - **Insufficient generalization ability**: These methods perform poorly in few - shot settings and cannot be well generalized to unseen action classes. 2. **Improve the performance and generalization ability of 3D skeleton - sequence action recognition**: - The paper proposes a new framework, STARS (Self - supervised Tuning for 3D Action Recognition in Skeleton Sequences), which combines Masked Auto - Encoder (MAE) and Contrastive Learning methods to improve the effect of self - supervised pre - training. - STARS first uses the masked prediction phase and then adjusts the encoder weights through the Nearest - Neighbor Contrastive Learning (NNCLR) part to enhance the semantic clustering of different actions, thereby improving the model's generalization ability and recognition accuracy. ### Specific problems and solutions - **Problem**: Existing self - supervised methods perform poorly in few - shot learning scenarios and the generated action - class clusters are not well - separated. - **Solution**: STARS enhances the learning ability of the encoder by introducing the contrastive learning phase, enabling the model to better recognize unseen actions in few - shot learning scenarios and form clearer action - class clusters. ### Main contributions 1. **Proposed the STARS framework**: This is a sequential self - supervised framework that significantly improves the performance and generalization ability of action recognition through MAE pre - training and contrastive learning fine - tuning. 2. **Improved the performance in few - shot learning**: Although the MAE method performs well in internal dataset evaluations, it performs poorly in few - shot learning scenarios. STARS significantly improves the ability in few - shot learning by combining contrastive learning while maintaining the advantages of the MAE method in internal dataset evaluations. 3. **Verified the effectiveness of the method**: Through experiments on multiple large - scale 3D skeleton - action - recognition datasets, it is proved that STARS achieves state - of - the - art performance in most cases. ### Summary The main goal of this paper is to solve the deficiencies of existing self - supervised pre - training methods in clustering effect and generalization ability by combining Masked Auto - Encoder and Contrastive Learning methods, thereby improving the performance of 3D skeleton - sequence action recognition.

STARS: Self-supervised Tuning for 3D Action Recognition in Skeleton Sequences

Self-supervised Action Representation Learning from Partial Spatio-Temporal Skeleton Sequences

Masked Motion Predictors Are Strong 3D Action Representation Learners

Unveiling the Hidden Realm: Self-supervised Skeleton-based Action Recognition in Occluded Environments

Efficient Spatio-Temporal Contrastive Learning for Skeleton-Based 3D Action Recognition

Skeleton-Contrastive 3D Action Representation Learning

Exploring Self-Supervised Skeleton-Based Human Action Recognition under Occlusions

Improving Self-Supervised Action Recognition from Extremely Augmented Skeleton Sequences

Skeleton2vec: A Self-supervised Learning Framework with Contextualized Target Representations for Skeleton Sequence

Self-Supervised 3D Skeleton Representation Learning with Active Sampling and Adaptive Relabeling for Action Recognition

Augmented Skeleton Based Contrastive Action Learning with Momentum LSTM for Unsupervised Action Recognition

STAR-Net: Action Recognition using Spatio-Temporal Activation Reprojection

Learning Representations by Contrastive Spatio-temporal Clustering for Skeleton-based Action Recognition

MS<SUP>2</SUP>L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition

MSST-RT: Multi-Stream Spatial-Temporal Relative Transformer for Skeleton-Based Action Recognition

Fine-Grained Side Information Guided Dual-Prompts for Zero-Shot Skeleton Action Recognition

Self-Supervised 3D Action Representation Learning With Skeleton Cloud Colorization

Skeleton-Based Action Recognition with Spatial Reasoning and Temporal Stack Learning

Spatial-Temporal Asynchronous Normalization for Unsupervised 3D Action Representation Learning

Temporal-masked skeleton-based action recognition with supervised contrastive learning

Part Aware Contrastive Learning for Self-Supervised Action Recognition