Abstract:Self-supervised video representation learning leaves out heavy manual annotation by automatically excavating supervisory signals. Although contrastive learning based approaches exhibit superior performances, pretext task based approaches still deserve further study. This is because the pretext tasks exploit the nature of data and encourage feature extractors to learn spatiotemporal logic by discovering dependencies among video clips or cubes, without manual engineering on data augmentations or manual construction of contrastive pairs. To utilize chronological property more effectively and efficiently, this work proposes a novel pretext task, named serial restoration of shuffled clips (SRSC), disentangled by an elaborately designed task network composed of an order-aware encoder and a serial restoration decoder. In contrast to other order based pretext tasks that formulate clip order recognition as a one-step classification problem, the proposed SRSC task restores shuffled clips into the right order in multiple steps. Owing to the excellent elasticity of SRSC, a novel taxonomy of curriculum learning is further proposed to equip SRSC with different pre-training strategies. According to the factors that affect the complexity of solving the SRSC task, the proposed curriculum learning strategies can be categorized into task based, model based and data based. Extensive experiments are conducted on the subdivided strategies to explore their effectiveness and noteworthy laws. Compared with existing approaches, this work demonstrates that the proposed approach achieves state-of-the-art performances in pretext task based self-supervised video representation learning and a majority of the proposed strategies further boost the performance of downstream tasks. For the first time, the features pre-trained by the pretext tasks are applied to video captioning by feature-level early fusion, and enhance the input of existing approaches as a lightweight plugin. The source code of this work can be found in https://mic.tongji.edu.cn .

Self-Supervised Video Representation Learning with Constrained Spatiotemporal Jigsaw

Self-supervised Spatiotemporal Representation Learning by Exploiting Video Continuity

Video Anomaly Detection by Solving Decoupled Spatio-Temporal Jigsaw Puzzles.

Self-Supervised Representation Learning for Videos by Segmenting Via Sampling Rate Order Prediction

Self-supervised Representation Learning by Predicting Visual Permutations

Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics

Temporally-Embedded Self-Supervised Video Representation Learning

Self-Supervised Spatiotemporal Learning Via Video Clip Order Prediction.

Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning

View Enhanced Jigsaw Puzzle for Self-Supervised Feature Learning in 3D Human Action Recognition

Learning Representations from Audio-Visual Spatial Alignment

Self-supervised Video Representation Learning by Serial Restoration with Elastic Complexity

Iterative Reorganization with Weak Spatial Constraints: Solving Arbitrary Jigsaw Puzzles for Unsupervised Representation Learning

Video 3D Sampling for Self-supervised Representation Learning

Self-Supervised Motion Perception for Spatiotemporal Representation Learning

Attentive spatial-temporal contrastive learning for self-supervised video representation

Explore Video Clip Order With Self-Supervised and Curriculum Learning for Video Applications

Spatio-Temporal Video Segmentation of Static Scenes and Its Applications

STC: Spatio-Temporal Contrastive Learning for Video Instance Segmentation.

Self-Supervised Video Representation Learning with Motion-Contrastive Perception

Self-supervised Video Representation Learning by Context and Motion Decoupling