Abstract:Self-supervised representation learning has proven constructive for skeleton-based action recognition. For better performance, existing methods mainly focus on (1) multi-modal data augmentations and (2) triplet contrastive samples construction. However, designing these strategies is always heuristics and hard. Instead of exploring more similar strategies, this paper addresses this issue with a different view and proposes a novel Contrastive Spatio-Temporal Clustering (CSTC) module. CSTC constructs a supervised signal (pseudo-label) of action sequences in an online clustering manner, and it is complementary to the recent data augmentations or triplet contrastive samples construction strategies. Specifically, CSTC can be formulated as an optimal transport problem. we introduce the spatio-temporal regularizations into the original optimal transport term to guide the pseudo-label generation, i.e., a semantic regularization learned by frame index is proposed to constrain the frame order, and a prior normal distribution regularization based on sampling characteristics of samples is proposed to maintain the dependability of spatial cluster assignments. Furthermore, to enhance the learning of latent features, we propose a Bidirectional Cross-modal Clustering Consistency Objective (B3CO) to enforce cluster assignments consistency for different modalities of the same sample. Last, since fusing spatial and temporal clustering losses directly during back-propagation will confuse the learned dimension-specific semantics, we propose a simple yet effective training strategy to fix it by training the model using these two losses alternately. By integrating the above designs into the MoCo framework, we propose a Contrastive Spatio-Temporal Clustering Network (CSTCN), which can excavate cross-modal discriminative spatio-temporal features in the clustering space. Experimental results on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD II datasets show that CSTCN achieves state of-the-art performance in both single- and multi-modal models, especially in the KNN and semi-supervised evaluation protocols. Besides, the key module CSTC shows good generalization capability, and achieves consistent performance improvement on the basis of several state-of-the-art methods which focus on data augmentations and triplet contrastive samples construction.

Unsupervised Video Action Clustering Via Motion-Scene Interaction Constraint

Point Contrastive Prediction with Semantic Clustering for Self-Supervised Learning on Point Cloud Videos

Unsupervised Sports Video Scene Clustering And Its Applications To Story Units Detection

Unsupervised Categorization of Human Motion Sequences

Exploiting Unsupervised and Supervised Constraints for Subspace Clustering

On Model-Based Clustering of Video Scenes Using Scenelets.

Human Action Recognition with Contextual Constraints Using a RGB-D Sensor

Mejigclu: more effective jigsaw clustering for unsupervised visual representation learning

A Matrix-Based Approach to Unsupervised Human Action Categorization

Unsupervised Action Segmentation by Joint Representation Learning and Online Clustering

An Unsupervised Approach To Dominant Video Scene Clustering

Deep video action clustering via spatio-temporal feature learning

A Joint Matrix Factorization Approach to Unsupervised Action Categorization

Multi-task Information Bottleneck Co-clustering for Unsupervised Cross-view Human Action Categorization

Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion

Unsupervised Person Clustering in Videos with Cross-Modal Communication.

Self-supervised pretext task collaborative multi-view contrastive learning for video action recognition

Subspace-Contrastive Multi-View Clustering

Learning Representations by Contrastive Spatio-temporal Clustering for Skeleton-based Action Recognition

Transform-Invariant Non-Parametric Clustering of Covariance Matrices and its Application to Unsupervised Joint Segmentation and Action Discovery

Revisiting Foreground and Background Separation in Weakly-supervised Temporal Action Localization: A Clustering-based Approach