Abstract:To learn from the numerous unlabeled data for smart infrastructure, we propose Enhanced Multi-Task Self-Supervised Learning (EMS2L) for self-supervised action recognition based on 3D human skeleton. With EMS2L, multiple self-supervised tasks are integrated to learn more comprehensive information, which is different from previous methods in which a single self-supervised task is manipulated. The self-supervised tasks employed here include task-specific methods (i.e., motion prediction and jigsaw puzzle task) and task-agnostic methods such as contrastive learning. Through the combination of these three self-supervised tasks, we can learn rich feature representations. Specifically, motion prediction is applied to extract detailed information by reconstructing original data from temporally masked and noisy sequences. Jigsaw puzzle makes the learned model capable of exploring temporal discriminative features for human action recognition by predicting the correct orders of shuffled sequences. Besides, to standardize the feature space, we utilize contrastive learning to constrain feature learning to increase the compactness within the class and separability between classes. To learn invariant representations, an attention model is proposed for contrastive representation learning to reduce the distance between original features and attention features. To avoid the performance degradation of network representation due to the pursuit of excessive invariance, this attention-based contrastive learning gives different degrees of weights to the features of different transformed data. Under a variety of settings, including fully-supervised, semi-supervised, unsupervised, and transfer learning, we evaluate EMS2L with downstream tasks. We also explore different network architectures (i.e., GRU GCN). The remarkable results on NW-UCLA, NTU RGB+D, and PKUMMD datasets illustrate the generality of our approach. With sufficient and extensive experiments, the advantage of our method is demonstrated by learning features that are more general and discriminative. Besides, we further provide more experimental analysis for different self-supervised tasks.

View Enhanced Jigsaw Puzzle for Self-Supervised Feature Learning in 3D Human Action Recognition

Knowledge-guided Pre-Training and Fine-Tuning: Video Representation Learning for Action Recognition

Revisiting the Self-supervised Learning Method of Solving Jigsaw Puzzles.

View-invariant Human Action Recognition Via Robust Locally Adaptive Multi-View Learning

Mejigclu: more effective jigsaw clustering for unsupervised visual representation learning

Shifting Perspective to See Difference: A Novel Multi-View Method for Skeleton Based Action Recognition

Iterative Reorganization with Weak Spatial Constraints: Solving Arbitrary Jigsaw Puzzles for Unsupervised Representation Learning

Self-supervised pretext task collaborative multi-view contrastive learning for video action recognition

Learning Heatmap-Style Jigsaw Puzzles Provides Good Pretraining for 2D Human Pose Estimation

EMS2L: Enhanced Multi-Task Self-Supervised Learning for 3D Skeleton Representation Learning

MS<SUP>2</SUP>L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition

Jigsaw-ViT: Learning Jigsaw Puzzles in Vision Transformer

3D-JEPA: A Joint Embedding Predictive Architecture for 3D Self-Supervised Representation Learning

Skeleton-Indexed Deep Multi-Modal Feature Learning for High Performance Human Action Recognition

Deep spatiotemporal LSTM network with temporal pattern feature for 3D human action recognition

Masked Motion Predictors Are Strong 3D Action Representation Learners

Unsupervised Learning of View-invariant Action Representations

Learning to Recognize 3D Human Action from A New Skeleton-based Representation Using Deep Convolutional Neural Networks

Temporal-Viewpoint Transportation Plan for Skeletal Few-shot Action Recognition

Self-supervised Action Representation Learning from Partial Spatio-Temporal Skeleton Sequences

Collaboratively Self-supervised Video Representation Learning for Action Recognition