Abstract:Most existing self-supervised works learn video representation by using a single pretext task. A single pretext task, providing single supervision from unlabeled data, may neglect to describe the difference between spatial features and temporal features. The similar spatial features and temporal features may hinder distinguishing between two similar videos with different class labels. In this paper, we propose an attentive spatial–temporal contrastive learning network (ASTCNet), which learns self-attention spatial–temporal features by contrastive learning between multiple spatial and temporal pretext tasks. The spatial features are learned by multiple spatial pretext tasks, including spatial rotation, and spatial jigsaw. Each spatial feature is enhanced with spatial self-attention by learning the relation between patches. The temporal features are learned by multiple temporal pretext tasks, including temporal order, and temporal pace. Each temporal feature is enhanced with temporal self-attention by learning the relation between frames, and is enhanced by feeding the optical flow features into a motion module. To separate the spatial feature and the temporal feature learned in one video, we represent the video as different features for each pretext task, and design pretext task-based contrastive loss. The pretext task-based contrastive loss encourages the different pretext tasks to learn dissimilar features, and encourages the same pretext task to learn similar features. The pretext task-based contrastive loss can learn the discriminative features for each pretext task in one video. The experiments show that our method achieves state-of-the-art performance for self-supervised action recognition on the UCF101 dataset and the HMDB51 dataset.

Mitigating background bias in self-supervised video representation learning

Point Contrastive Prediction with Semantic Clustering for Self-Supervised Learning on Point Cloud Videos

Enhancing Motion Visual Cues for Self-Supervised Video Representation Learning

Self-Supervised Scene-Debiasing for Video Representation Learning via Background Patching

Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization

Self-supervised pretext task collaborative multi-view contrastive learning for video action recognition

Cross-modal Manifold Cutmix for Self-supervised Video Representation Learning

Masked Motion Encoding for Self-Supervised Video Representation Learning

Self-Supervised Video Representation Learning with Motion-Contrastive Perception

Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion

Mitigating and Evaluating Static Bias of Action Representations in the Background and the Foreground

Detecting Moving Objects from Dynamic Background Combining Subspace Learning with Mixed Norm Approach

Attentive spatial-temporal contrastive learning for self-supervised video representation

Motion Sensitive Contrastive Learning for Self-supervised Video Representation

Self-Supervised Monocular Depth Estimation With Self-Perceptual Anomaly Handling

Motion-Focused Contrastive Learning of Video Representations*

Cross-view motion consistent self-supervised video inter-intra contrastive for action representation understanding

Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics

Hierarchically Decoupled Spatial-Temporal Contrast for Self-supervised Video Representation Learning

Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting

Memory-augmented Dense Predictive Coding for Video Representation Learning