Dynamic-boosting Attention for Self-Supervised Video Representation Learning

Wang Zhipeng,Hou Chunping,Yue Guanghui,Yang Qingyuan
DOI: https://doi.org/10.1007/s10489-021-02440-0
IF: 5.3
2021-01-01
Applied Intelligence
Abstract:Self-supervised video representation learning leverages supervisory signal of data itself to obtain scalable video representation for downstream tasks, i.e., action recognition. Previous methods mainly leverage temporal signals to learn the temporal relationship between video frames. However, these methods learn weak semantic information of videos due to the lack of semantic labels. Moreover, they cannot train models sufficiently due to the interference of the meaningless frames. To tackle these problems, this paper proposes a novel self-supervised video representation learning method, which guides the network to learn compact and effective semantic information and temporal relationship of videos. Specifically, we introduce the video clip order prediction (VCOP) pretext task to learn the temporal relationship of video frames. On the basis of VCOP, we further propose a Dynamic-Boosting Attention (DBA) module to mine the video semantic information and select the key frames softly. DBA performs a dynamic boosting scheme to extract the semantic information from the high-level video feature and uses the semantic information to softly select the low-level key frame features. We train 3D CNNs with our method and apply the learned model as the pretrained model on two downstream tasks. Experimental results demonstrate that, our DBA method can increase the training efficiency of self-supervised learning. And notably, our 3D CNN model learns great semantic knowledge and achieves obvious improvement on downstream tasks.
What problem does this paper attempt to address?