Temporally-Embedded Self-Supervised Video Representation Learning

Feng Wang,Huaping Liu
DOI: https://doi.org/10.1109/icus52573.2021.9641272
2021-01-01
Abstract:Spatial apperance and temporal relation are two intrinsic properties of videos. 3D CNNs are natural architectures to capture the spatio-temporal information in videos. However, in order to successfully learn saptio-temporal video representations, it usually requires massive amounts of manually labeled videos, making it impractical to scale. So in this paper, we propose a general framework to learn spatio-temporal video representations in a self-supervised way. The proposed framework consists of two stages organized in a progressive learning manner. In the first stage, we trained our 3D ConvNets to capture semantic spatial information by transfering successful pretext tasks desinged for image representations to video domain. And in the second stage, we embed the temporal knowledge into the learned network by classifying a synthetic motion dataset whose labels can be generated automatically. We conduct experiments to validate the effectiveness of the proposed framework. When fine-tune our pre-trained models on action recognition benchmarks, our model brings a remarkable gain of 26.3% on UCF101 and 20.6% on HMDB51 compared with the models trained from scratch, outperforming current state of the art method.
What problem does this paper attempt to address?