Pixel-level Correspondence for Self-Supervised Learning from Video

Yash Sharma,Yi Zhu,Chris Russell,Thomas Brox

DOI: https://doi.org/10.48550/arXiv.2207.03866

2022-07-08

Abstract:While self-supervised learning has enabled effective representation learning in the absence of labels, for vision, video remains a relatively untapped source of supervision. To address this, we propose Pixel-level Correspondence (PiCo), a method for dense contrastive learning from video. By tracking points with optical flow, we obtain a correspondence map which can be used to match local features at different points in time. We validate PiCo on standard benchmarks, outperforming self-supervised baselines on multiple dense prediction tasks, without compromising performance on image classification.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is: how to use pixel - level correspondence in videos for self - supervised learning, so as to obtain more effective dense representations. Specifically, the authors propose a method named PICO (Pixel - level Correspondence), which realizes dense contrastive learning by tracking local features at different time points in videos, aiming to overcome the limitations of existing static image methods and make full use of the inherent spatio - temporal changes in video data. The main contributions of the paper include: 1. **Introducing pixel - level correspondence**: By using off - the - shelf optical flow estimators to track points in videos and construct a correspondence map, local features at different moments are matched. 2. **Improving the performance of downstream tasks**: The effectiveness of PICO has been verified on multiple benchmark tests and tasks. In particular, it is significantly superior to existing self - supervised baseline methods in dense prediction tasks (such as semantic segmentation, object detection, etc.). 3. **Maximizing time separation and trajectory density**: An anchor sampling strategy is proposed to select frames with the maximum time separation and trajectory density, further optimizing the learning process. Through these methods, PICO can not only achieve better performance in dense prediction tasks, but also will not affect the performance of global prediction tasks such as image classification.

Pixel-level Correspondence for Self-Supervised Learning from Video

Point Contrastive Prediction with Semantic Clustering for Self-Supervised Learning on Point Cloud Videos

Learning multi-view visual correspondences with self-supervision

Learning Fine-Grained Features for Pixel-wise Video Correspondences

Spatial-then-Temporal Self-Supervised Learning for Video Correspondence.

Contrastive Transformation for Self-supervised Correspondence Learning

Rethinking Self-supervised Correspondence Learning: A Video Frame-level Similarity Perspective

Dense Contrastive Learning for Self-Supervised Visual Pre-Training

Online Object Representations with Contrastive Learning

Self-Supervised Video Representation Learning with Motion-Contrastive Perception

Locality-Aware Inter-and Intra-Video Reconstruction for Self-Supervised Correspondence Learning

Semantic-Aware Fine-Grained Correspondence

CrossVideo: Self-supervised Cross-modal Contrastive Learning for Point Cloud Video Understanding

Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning

Learning By Analogy: Reliable Supervision From Transformations For Unsupervised Optical Flow Estimation

Motion Sensitive Contrastive Learning for Self-supervised Video Representation

Joint-task Self-supervised Learning for Temporal Correspondence

Self-Supervised Visual Representations Learning by Contrastive Mask Prediction

P4Contrast: Contrastive Learning with Pairs of Point-Pixel Pairs for RGB-D Scene Understanding

Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework

Contrastive Learning of Image Representations with Cross-Video Cycle-Consistency