Self-Supervised Skeleton-Based Action Representation Learning: A Benchmark and Beyond

Jiahang Zhang,Lilang Lin,Shuai Yang,Jiaying Liu
2024-08-26
Abstract:Self-supervised learning (SSL), which aims to learn meaningful prior representations from unlabeled data, has been proven effective for skeleton-based action understanding. Different from the image domain, skeleton data possesses sparser spatial structures and diverse representation forms, with the absence of background clues and the additional temporal dimension, presenting new challenges for spatial-temporal motion pretext task design. Recently, many endeavors have been made for skeleton-based SSL, achieving remarkable progress. However, a systematic and thorough review is still lacking. In this paper, we conduct, for the first time, a comprehensive survey on self-supervised skeleton-based action representation learning. Following the taxonomy of context-based, generative learning, and contrastive learning approaches, we make a thorough review and benchmark of existing works and shed light on the future possible directions. Remarkably, our investigation demonstrates that most SSL works rely on the single paradigm, learning representations of a single level, and are evaluated on the action recognition task solely, which leaves the generalization power of skeleton SSL models under-explored. To this end, a novel and effective SSL method for skeleton is further proposed, which integrates versatile representation learning objectives of different granularity, substantially boosting the generalization capacity for multiple skeleton downstream tasks. Extensive experiments under three large-scale datasets demonstrate our method achieves superior generalization performance on various downstream tasks, including recognition, retrieval, detection, and few-shot learning.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the application challenges of self - supervised learning (SSL) in skeleton action understanding. Specifically, the paper focuses on the following aspects: 1. **Limitations of existing methods**: Most of the existing SSL methods focus on a single paradigm and learn representations of a single granularity (such as joint - level features or sequence - level features), which limits the generalization ability of the model in more downstream tasks. 2. **Lack of systematic review**: Although significant progress has been made in self - supervised learning of skeleton data in recent years, there is still a lack of a systematic and comprehensive literature review and analysis. 3. **New challenges**: Different from the image field, skeleton data has a more compact spatial structure, an additional time dimension, and a lack of background cues. These characteristics bring new challenges to the design of effective spatio - temporal motion pre - training tasks. To address these problems, the main contributions of the paper are as follows: - **Provide a comprehensive review for the first time**: Based on the classification of three methods, namely context, generative learning, and contrastive learning, a detailed analysis of the existing self - supervised skeleton action representation learning is carried out, and the special designs for skeleton data and the challenges they face are emphasized. - **Propose a new self - supervised learning method**: This method combines contrastive learning and masked skeleton modeling (MSM) to jointly learn representations of different granularities (joint - level, segment - level, and sequence - level), thereby significantly improving the generalization ability of the model in multiple downstream tasks. - **Provide extensive benchmark tests**: It not only summarizes the existing skeleton SSL work, but also conducts a detailed discussion on popular datasets and downstream tasks, and provides in - depth insights from the perspectives of model backbones, pre - training paradigms, etc., demonstrating the superior performance of the proposed method in five downstream tasks. Through these contributions, the paper aims to promote the research on self - supervised skeleton action representation learning and provide rich insights for future work.