Learning multi-view visual correspondences with self-supervision
Pengcheng Zhang,Lei Zhou,Xiao Bai,Chen Wang,Jun Zhou,Liang Zhang,Jin Zheng
DOI: https://doi.org/10.1016/j.displa.2022.102160
IF: 3.074
2022-04-01
Displays
Abstract:Stereo-based 3D reconstruction requires to match features across images captured from slightly different viewing angles to recover 3D coordinates of the image pixels. Despite the workload of collecting data, annotating matched pixels requires also heavy labor. As recent researches for self-supervised representation learning has gained great progress, learning multi-view visual correspondences from large scale raw videos serves as an alternative. However, existing methods which benefit from contrastive learning tend to neglect false negative samples when matching between adjacent frames in a video, leading to sub-optimal optimization for visual features. In this paper, we propose a contrastive learning framework that construct self-supervision by semi-global visual correspondence to alleviate learning degradation when false negatives are involved in training. Our learning framework consists of pixel-level contrastive learning via patch reconstruction and patch-level contrastive learning cross videos. We also introduce saliency guidance to extract salient regions from video frames to further reduce potential false negatives. By optimizing the model with the proposed semi-global contrastive learning method, learned representations are forced to be discriminative and robust. Experiments demonstrate that our proposed method outperforms previous self-supervised methods on video object segmentation tasks. Moreover, when compared to fully-supervised algorithms designed for specific tasks, our proposed method also achieves competitive results.
engineering, electrical & electronic,instruments & instrumentation,optics,computer science, hardware & architecture