Abstract:To learn from the numerous unlabeled data for smart infrastructure, we propose Enhanced Multi-Task Self-Supervised Learning (EMS2L) for self-supervised action recognition based on 3D human skeleton. With EMS2L, multiple self-supervised tasks are integrated to learn more comprehensive information, which is different from previous methods in which a single self-supervised task is manipulated. The self-supervised tasks employed here include task-specific methods (i.e., motion prediction and jigsaw puzzle task) and task-agnostic methods such as contrastive learning. Through the combination of these three self-supervised tasks, we can learn rich feature representations. Specifically, motion prediction is applied to extract detailed information by reconstructing original data from temporally masked and noisy sequences. Jigsaw puzzle makes the learned model capable of exploring temporal discriminative features for human action recognition by predicting the correct orders of shuffled sequences. Besides, to standardize the feature space, we utilize contrastive learning to constrain feature learning to increase the compactness within the class and separability between classes. To learn invariant representations, an attention model is proposed for contrastive representation learning to reduce the distance between original features and attention features. To avoid the performance degradation of network representation due to the pursuit of excessive invariance, this attention-based contrastive learning gives different degrees of weights to the features of different transformed data. Under a variety of settings, including fully-supervised, semi-supervised, unsupervised, and transfer learning, we evaluate EMS2L with downstream tasks. We also explore different network architectures (i.e., GRU GCN). The remarkable results on NW-UCLA, NTU RGB+D, and PKUMMD datasets illustrate the generality of our approach. With sufficient and extensive experiments, the advantage of our method is demonstrated by learning features that are more general and discriminative. Besides, we further provide more experimental analysis for different self-supervised tasks.

S3DS: Self-supervised Learning of 3D Skeletons from Single View Images

3D Articulated Skeleton Extraction Using a Single Consumer-Grade Depth Camera.

S3O: A Dual-Phase Approach for Reconstructing Dynamic Shape and Skeleton of Articulated Objects from Single Monocular Video

SDF-SRN: Learning Signed Distance 3D Object Reconstruction from Static Images

Unsupervised Articulated Skeleton Extraction from Point Set Sequences Captured by a Single Depth Camera

Semi-supervised Single-view 3D Reconstruction via Multi Shape Prior Fusion Strategy and Self-Attention

Human Mesh Recovery from Monocular Images via a Skeleton-disentangled Representation

gSDF: Geometry-Driven Signed Distance Functions for 3D Hand-Object Reconstruction

Self-Supervised 3D Mesh Reconstruction from Single Images

Self-Supervised Learning for Non-Rigid Registration Between Near-Isometric 3D Surfaces in Medical Imaging.

Model-based 3D Hand Reconstruction via Self-Supervised Learning

EMS2L: Enhanced Multi-Task Self-Supervised Learning for 3D Skeleton Representation Learning

Semi-Supervised Single-View 3D Reconstruction Via Prototype Shape Priors

S‐LASSIE: Structure and smoothness enhanced learning from sparse image ensemble for 3D articulated shape reconstruction

3D3M: 3D Modulated Morphable Model for Monocular Face Reconstruction

3D Self-Supervised Methods for Medical Imaging

Point2Skeleton: Learning Skeletal Representations from Point Clouds

PointSkelCNN: Deep Learning-Based 3D Human Skeleton Extraction from Point Clouds

IMMAT: Mesh Reconstruction from Single View Images by Medial Axis Transform Prediction

Unsupervised 3D Pose Estimation with Non-Rigid Structure-from-Motion Modeling

Weakly Supervised 3D Human Pose and Shape Reconstruction with Normalizing Flows