Abstract:Self-supervised representation learning for videos has been very attractive recently because these methods exploit the information inherently obtained from the video itself instead of annotated labels that is quite time-consuming. However, existing methods ignore the importance of global observation while performing spatio-temporal transformation perception, which highly limits the expression capabilities of the video representation. This paper proposes a novel pretext task that combines the temporal information perception of the video with the motion amplitude perception of moving objects to learn the spatio-temporal representation of the video. Specifically, given a video clip containing several video segments, each video segment is sampled by different sampling rates and the order of video segments is disrupted. Then, the network is used to regress the sampling rate of each video segment and classify the order of input video segments. In the pre-training stage, the network can learn rich spatio-temporal semantic information where content-related contrastive learning is introduced to make the learned video representation more discriminative. To alleviate the appearance dependency caused by contrastive learning, we design a novel and robust vector similarity measurement approach, which can take feature alignment into consideration. Moreover, a view synthesis framework is proposed to further improve the performance of contrastive learning by automatically generating reasonable transformed views. We conduct benchmark experiments with several 3D backbone networks on two datasets. The results show that our proposed method outperforms the existing state-of-the-art methods across the three backbones on two downstream tasks of human action recognition and video retrieval.

Learnable Query Contrast and Spatio-temporal Prediction on Point Cloud Video Pre-training

Point Contrastive Prediction with Semantic Clustering for Self-Supervised Learning on Point Cloud Videos

Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud Videos

Knowledge-guided Pre-Training and Fine-Tuning: Video Representation Learning for Action Recognition

Contrastive Predictive Autoencoders for Dynamic Point Cloud Self-Supervised Learning

SimIPU: Simple 2D Image and 3D Point Cloud Unsupervised Pre-Training for Spatial-Aware Visual Representations

Masked Motion Prediction with Semantic Contrast for Point Cloud Sequence Learning

Attentive spatial-temporal contrastive learning for self-supervised video representation

Point-GCC: Universal Self-supervised 3D Scene Pre-training via Geometry-Color Contrast

Align Yourself: Self-supervised Pre-training for Fine-grained Recognition via Saliency Alignment.

Self-supervised pretext task collaborative multi-view contrastive learning for video action recognition

Self-Supervised Representation Learning for Videos by Segmenting Via Sampling Rate Order Prediction

CrossVideo: Self-supervised Cross-modal Contrastive Learning for Point Cloud Video Understanding

PointVST: Self-Supervised Pre-training for 3D Point Clouds via View-Specific Point-to-Image Translation

CLIP2: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data

ProposalContrast: Unsupervised Pre-training for LiDAR-based 3D Object Detection

Patch-level Contrastive Learning Via Positional Query for Visual Pre-training.

PointCG: Self-supervised Point Cloud Learning via Joint Completion and Generation

ECO-3D: Equivariant Contrastive Learning for Pre-training on Perturbed 3D Point Cloud

Contrastive Language-Action Pre-training for Temporal Localization

Point-Level Region Contrast for Object Detection Pre-Training