Self-Supervised Video Representation Learning Using Improved Instance-wise Contrastive Learning and Deep Clustering

Yisheng Zhu,Hui Shuai,Guangcan Liu,Qingshan Liu
DOI: https://doi.org/10.1109/tcsvt.2022.3169469
IF: 5.859
2022-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Instance-wise contrastive learning (Instance-CL), which learns to map similar instances closer and different instances farther apart in the embedding space, has achieved considerable progress in self-supervised video representation learning. However, canonical Instance-CL does not handle properly the temporal similarities between different videos, limiting the representation capabilities of learned models. This paper presents a novel two-stage framework that combines Instance-CL and unsupervised clustering to progressively learn desirable temporal representations with high intra-class compactness. Specifically, (a) we first introduce a new consistency-preserving sampling strategy to generate positive/negative pairs. Compared to the traditional sampling methods, our sampling strategy focuses more on motion dynamics, resulting in more temporal-related feature representations. (b) To further explore the temporal similarities between videos so as to encourage intra-class compactness, we set temporal representations extracted from Instance-CL as an initializer, and iteratively use k-means clustering to generate pseudo-labels for training the encoder. We term our method as Improved Instance-CL with Deep Clustering (ICDC) and apply it to two downstream tasks, including action recognition and video retrieval. Extensive experimental results show that ICDC gains considerable improvements compared to the existing self-supervised methods.
What problem does this paper attempt to address?