Abstract:Self-supervised representation learning for videos has been very attractive recently because these methods exploit the information inherently obtained from the video itself instead of annotated labels that is quite time-consuming. However, existing methods ignore the importance of global observation while performing spatio-temporal transformation perception, which highly limits the expression capabilities of the video representation. This paper proposes a novel pretext task that combines the temporal information perception of the video with the motion amplitude perception of moving objects to learn the spatio-temporal representation of the video. Specifically, given a video clip containing several video segments, each video segment is sampled by different sampling rates and the order of video segments is disrupted. Then, the network is used to regress the sampling rate of each video segment and classify the order of input video segments. In the pre-training stage, the network can learn rich spatio-temporal semantic information where content-related contrastive learning is introduced to make the learned video representation more discriminative. To alleviate the appearance dependency caused by contrastive learning, we design a novel and robust vector similarity measurement approach, which can take feature alignment into consideration. Moreover, a view synthesis framework is proposed to further improve the performance of contrastive learning by automatically generating reasonable transformed views. We conduct benchmark experiments with several 3D backbone networks on two datasets. The results show that our proposed method outperforms the existing state-of-the-art methods across the three backbones on two downstream tasks of human action recognition and video retrieval.

Shuffle and learn: unsupervised learning using temporal order verification

Unsupervised learning using sequential verification for action recognition

Self-Supervised Spatiotemporal Learning Via Video Clip Order Prediction.

Explore Video Clip Order With Self-Supervised and Curriculum Learning for Video Applications

Skip-Clip: Self-Supervised Spatiotemporal Representation Learning by Future Clip Order Ranking

Self-supervised Representation Learning by Predicting Visual Permutations

Self-Supervised Visual Learning by Variable Playback Speeds Prediction of a Video

Self-Supervised Representation Learning for Videos by Segmenting Via Sampling Rate Order Prediction

DistInit: Learning Video Representations Without a Single Labeled Video

Learning Features by Watching Objects Move

Temporally-Embedded Self-Supervised Video Representation Learning

Unsupervised Learning of Video Representations using LSTMs

Made to Order: Discovering monotonic temporal changes via self-supervised video ordering

Learning Transferable Spatiotemporal Representations from Natural Script Knowledge.

Can Shuffling Video Benefit Temporal Bias Problem: A Novel Training Framework for Temporal Grounding

Unsupervised Learning of Long-Term Motion Dynamics for Videos

Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics

High Order Neural Networks for Video Classification.

Learning to Sort Image Sequences via Accumulated Temporal Differences

Action Shuffling for Weakly Supervised Temporal Localization

Order-Constrained Representation Learning for Instructional Video Prediction