Video Representation Learning with Graph Contrastive Augmentation

Jingran Zhang,Xing Xu,Fumin Shen,Yazhou Yao,Jie Shao,Xiaofeng Zhu
DOI: https://doi.org/10.1145/3474085.3475510
2021-01-01
Abstract:Contrastive-based self-supervised learning for image representations has significantly closed the gap with supervised learning. A natural extension of image-based contrastive learning methods to the video domain is to fully exploit the temporal structure presented in videos. We propose a novel contrastive self-supervised video representation learning framework, termed Graph Contrastive Augmentation (GCA), by constructing a video temporal graph and devising a graph augmentation that is designed to enhance the correlation across frames of videos and developing a new view for exploring temporal structure in videos. Specifically, we construct the temporal graph in the video by leveraging the relational knowledge behind the correlated sequence video features. Afterwards, we apply the proposed graph augmentation to generate another graph view by cooperating random corruption of the original graph to enhance the diversity of the intrinsic structure of the temporal graph. To this end, we provide two different kinds of contrastive learning methods to train our framework using temporal relationships concealed in videos as self-supervised signals. We perform empirical experiments on downstream tasks, action recognition and video retrieval, using the learned video representation, and the results demonstrate that with the graph view of temporal structure, our proposed GCA remarkably improves performance against or on par with the recent methods.
What problem does this paper attempt to address?