Explore Video Clip Order With Self-Supervised and Curriculum Learning for Video Applications

Jun Xiao,Lin Li,Dejing Xu,Chengjiang Long,Jian Shao,Shifeng Zhang,Shiliang Pu,Yueting Zhuang
DOI: https://doi.org/10.1109/tmm.2020.3025661
IF: 7.3
2021-01-01
IEEE Transactions on Multimedia
Abstract:We present a self-supervised spatiotemporal learning approach by exploring the temporal coherence of videos. The chronological order of shuffled clips from the video is used as the supervisory signal to guide the 3D Convolutional Neural Networks (CNNs) to learn meaningful visual knowledge. Unlike the existing approaches which use frames, we utilize dynamic video clips to reduce the uncertainty of order. We test three types of representative 3D CNNs, all of which benefit from the proposed approach. The learned 3D CNNs can be used either as a feature extractor or a pre-trained model for further fine-tuning on downstream tasks. We also propose two curriculum learning strategies to make the 3D CNNs easier to train and get the state-of-the-art results in nearest neighbor retrieval and action recognition tasks compared with other self-supervised learning methods. Meanwhile, it is further extended to the field of visual question answering application and has achieved promising results. Besides, comprehensive and extensive experimental results and analyses are provided for readers to better understand the video clip order we explore with self-supervised and curriculum learning for video application.
What problem does this paper attempt to address?