Abstract:Self-supervised learning (SSL), which aims to learn meaningful prior representations from unlabeled data, has been proven effective for skeleton-based action understanding. Different from the image domain, skeleton data possesses sparser spatial structures and diverse representation forms, with the absence of background clues and the additional temporal dimension, presenting new challenges for spatial-temporal motion pretext task design. Recently, many endeavors have been made for skeleton-based SSL, achieving remarkable progress. However, a systematic and thorough review is still lacking. In this paper, we conduct, for the first time, a comprehensive survey on self-supervised skeleton-based action representation learning. Following the taxonomy of context-based, generative learning, and contrastive learning approaches, we make a thorough review and benchmark of existing works and shed light on the future possible directions. Remarkably, our investigation demonstrates that most SSL works rely on the single paradigm, learning representations of a single level, and are evaluated on the action recognition task solely, which leaves the generalization power of skeleton SSL models under-explored. To this end, a novel and effective SSL method for skeleton is further proposed, which integrates versatile representation learning objectives of different granularity, substantially boosting the generalization capacity for multiple skeleton downstream tasks. Extensive experiments under three large-scale datasets demonstrate our method achieves superior generalization performance on various downstream tasks, including recognition, retrieval, detection, and few-shot learning.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the application challenges of self - supervised learning (SSL) in skeleton action understanding. Specifically, the paper focuses on the following aspects: 1. **Limitations of existing methods**: Most of the existing SSL methods focus on a single paradigm and learn representations of a single granularity (such as joint - level features or sequence - level features), which limits the generalization ability of the model in more downstream tasks. 2. **Lack of systematic review**: Although significant progress has been made in self - supervised learning of skeleton data in recent years, there is still a lack of a systematic and comprehensive literature review and analysis. 3. **New challenges**: Different from the image field, skeleton data has a more compact spatial structure, an additional time dimension, and a lack of background cues. These characteristics bring new challenges to the design of effective spatio - temporal motion pre - training tasks. To address these problems, the main contributions of the paper are as follows: - **Provide a comprehensive review for the first time**: Based on the classification of three methods, namely context, generative learning, and contrastive learning, a detailed analysis of the existing self - supervised skeleton action representation learning is carried out, and the special designs for skeleton data and the challenges they face are emphasized. - **Propose a new self - supervised learning method**: This method combines contrastive learning and masked skeleton modeling (MSM) to jointly learn representations of different granularities (joint - level, segment - level, and sequence - level), thereby significantly improving the generalization ability of the model in multiple downstream tasks. - **Provide extensive benchmark tests**: It not only summarizes the existing skeleton SSL work, but also conducts a detailed discussion on popular datasets and downstream tasks, and provides in - depth insights from the perspectives of model backbones, pre - training paradigms, etc., demonstrating the superior performance of the proposed method in five downstream tasks. Through these contributions, the paper aims to promote the research on self - supervised skeleton action representation learning and provide rich insights for future work.

Self-Supervised Skeleton-Based Action Representation Learning: A Benchmark and Beyond

Self-Supervised 3D Skeleton Representation Learning with Active Sampling and Adaptive Relabeling for Action Recognition

Skeleton-Contrastive 3D Action Representation Learning

Contrast-reconstruction Representation Learning for Self-supervised Skeleton-based Action Recognition

Self-supervised Action Representation Learning from Partial Spatio-Temporal Skeleton Sequences

MS<SUP>2</SUP>L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition

Self-Supervised Action Representation Learning Based on Asymmetric Skeleton Data Augmentation

Unsupervised Representation Learning With Long-Term Dynamics for Skeleton Based Action Recognition

Cross-Stream Contrastive Learning for Self-Supervised Skeleton-Based Action Recognition

Representation modeling learning with multi-domain decoupling for unsupervised skeleton-based action recognition

Improving Self-Supervised Action Recognition from Extremely Augmented Skeleton Sequences

EMS2L: Enhanced Multi-Task Self-Supervised Learning for 3D Skeleton Representation Learning

Self-supervised visual learning from interactions with objects

Part Aware Contrastive Learning for Self-Supervised Action Recognition

Sparse Semi-Supervised Action Recognition with Active Learning

Actionlet-Dependent Contrastive Learning for Unsupervised Skeleton-Based Action Recognition

A Survey on 3D Skeleton-Based Action Recognition Using Learning Method

Balanced Representation Learning for Long-tailed Skeleton-based Action Recognition

Efficient Spatio-Temporal Contrastive Learning for Skeleton-Based 3D Action Recognition

Contrastive Learning from Extremely Augmented Skeleton Sequences for Self-Supervised Action Recognition