Pixel-level Correspondence for Self-Supervised Learning from Video

Yash Sharma,Yi Zhu,Chris Russell,Thomas Brox
DOI: https://doi.org/10.48550/arXiv.2207.03866
2022-07-08
Abstract:While self-supervised learning has enabled effective representation learning in the absence of labels, for vision, video remains a relatively untapped source of supervision. To address this, we propose Pixel-level Correspondence (PiCo), a method for dense contrastive learning from video. By tracking points with optical flow, we obtain a correspondence map which can be used to match local features at different points in time. We validate PiCo on standard benchmarks, outperforming self-supervised baselines on multiple dense prediction tasks, without compromising performance on image classification.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is: how to use pixel - level correspondence in videos for self - supervised learning, so as to obtain more effective dense representations. Specifically, the authors propose a method named PICO (Pixel - level Correspondence), which realizes dense contrastive learning by tracking local features at different time points in videos, aiming to overcome the limitations of existing static image methods and make full use of the inherent spatio - temporal changes in video data. The main contributions of the paper include: 1. **Introducing pixel - level correspondence**: By using off - the - shelf optical flow estimators to track points in videos and construct a correspondence map, local features at different moments are matched. 2. **Improving the performance of downstream tasks**: The effectiveness of PICO has been verified on multiple benchmark tests and tasks. In particular, it is significantly superior to existing self - supervised baseline methods in dense prediction tasks (such as semantic segmentation, object detection, etc.). 3. **Maximizing time separation and trajectory density**: An anchor sampling strategy is proposed to select frames with the maximum time separation and trajectory density, further optimizing the learning process. Through these methods, PICO can not only achieve better performance in dense prediction tasks, but also will not affect the performance of global prediction tasks such as image classification.