Abstract:Self-supervised representation learning has proven constructive for skeleton-based action recognition. For better performance, existing methods mainly focus on (1) multi-modal data augmentations and (2) triplet contrastive samples construction. However, designing these strategies is always heuristics and hard. Instead of exploring more similar strategies, this paper addresses this issue with a different view and proposes a novel Contrastive Spatio-Temporal Clustering (CSTC) module. CSTC constructs a supervised signal (pseudo-label) of action sequences in an online clustering manner, and it is complementary to the recent data augmentations or triplet contrastive samples construction strategies. Specifically, CSTC can be formulated as an optimal transport problem. we introduce the spatio-temporal regularizations into the original optimal transport term to guide the pseudo-label generation, i.e., a semantic regularization learned by frame index is proposed to constrain the frame order, and a prior normal distribution regularization based on sampling characteristics of samples is proposed to maintain the dependability of spatial cluster assignments. Furthermore, to enhance the learning of latent features, we propose a Bidirectional Cross-modal Clustering Consistency Objective (B3CO) to enforce cluster assignments consistency for different modalities of the same sample. Last, since fusing spatial and temporal clustering losses directly during back-propagation will confuse the learned dimension-specific semantics, we propose a simple yet effective training strategy to fix it by training the model using these two losses alternately. By integrating the above designs into the MoCo framework, we propose a Contrastive Spatio-Temporal Clustering Network (CSTCN), which can excavate cross-modal discriminative spatio-temporal features in the clustering space. Experimental results on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD II datasets show that CSTCN achieves state of-the-art performance in both single- and multi-modal models, especially in the KNN and semi-supervised evaluation protocols. Besides, the key module CSTC shows good generalization capability, and achieves consistent performance improvement on the basis of several state-of-the-art methods which focus on data augmentations and triplet contrastive samples construction.

See Your Emotion from Gait Using Unlabeled Skeleton Data.

Self-supervised Gait-based Emotion Representation Learning from Selective Strongly Augmented Skeleton Sequences

A Multi-Head Pseudo Nodes Based Spatial–temporal Graph Convolutional Network for Emotion Perception from GAIT

G-GCSN: Global Graph Convolution Shrinkage Network for Emotion Perception from Gait

Take an Emotion Walk: Perceiving Emotions from Gaits Using Hierarchical Attention Pooling and Affective Mapping

Human Gait Recognition Based on Self-Adaptive Hidden Markov Model

Skeleton-Contrastive 3D Action Representation Learning

Condition-Adaptive Graph Convolution Learning for Skeleton-Based Gait Recognition

SkeletonGait: Gait Recognition Using Skeleton Maps

Looking into Gait for Perceiving Emotions via Bilateral Posture and Movement Graph Convolutional Networks

ST-Gait++: Leveraging spatio-temporal convolutions for gait-based emotion recognition on videos

Augmented Skeleton Based Contrastive Action Learning with Momentum LSTM for Unsupervised Action Recognition

Contrastive Learning from Extremely Augmented Skeleton Sequences for Self-Supervised Action Recognition

Condition-Aware Comparison Scheme for Gait Recognition

MetaGait: Learning to Learn an Omni Sample Adaptive Representation for Gait Recognition

Learning Representations by Contrastive Spatio-temporal Clustering for Skeleton-based Action Recognition

GaitSCM: Causal representation learning for gait recognition

Learning Rich Features for Gait Recognition by Integrating Skeletons and Silhouettes

Contrastive Learning from Spatio-Temporal Mixed Skeleton Sequences for Self-Supervised Skeleton-Based Action Recognition

On Learning Disentangled Representations for Gait Recognition

Hierarchical-Attention-Based Neural Network for Gait Emotion Recognition