Self-Supervised Spatiotemporal Learning Via Video Clip Order Prediction.

Dejing Xu,Jun Xiao,Zhou Zhao,Jian Shao,Di Xie,Yueting Zhuang
DOI: https://doi.org/10.1109/cvpr.2019.01058
2019-01-01
Abstract:We propose a self-supervised spatiotemporal learning technique which leverages the chronological order of videos. Our method can learn the spatiotemporal representation of the video by predicting the order of shuffled clips from the video. The category of the video is not required, which gives our technique the potential to take advantage of infinite unannotated videos. There exist related works which use frames, while compared to frames, clips are more consistent with the video dynamics. Clips can help to reduce the uncertainty of orders and are more appropriate to learn a video representation. The 3D convolutional neural networks are utilized to extract features for clips, and these features are processed to predict the actual order. The learned representations are evaluated via nearest neighbor retrieval experiments. We also use the learned networks as the pre-trained models and finetune them on the action recognition task. Three types of 3D convolutional neural networks are tested in experiments, and we gain large improvements compared to existing self-supervised methods.
What problem does this paper attempt to address?