Abstract:We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up, we build a dynamic dictionary with a queue and a moving-averaged encoder. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning. MoCo provides competitive results under the common linear protocol on ImageNet classification. More importantly, the representations learned by MoCo transfer well to downstream tasks. MoCo can outperform its supervised pre-training counterpart in 7 detection/segmentation tasks on PASCAL VOC, COCO, and other datasets, sometimes surpassing it by large margins. This suggests that the gap between unsupervised and supervised representation learning has been largely closed in many vision tasks.

What problem does this paper attempt to address?

The paper primarily addresses the problem of unsupervised representation learning in the field of computer vision, specifically how to use contrastive loss functions to construct high-quality visual feature representations. Specifically, the paper proposes the Momentum Contrast (MoCo) method, which is a new framework for unsupervised visual representation learning. The core contributions of MoCo are: 1. **Construction of a dynamic dictionary**: By maintaining a sample queue as a dictionary and using the current mini-batch data to update the queue, the dictionary can be continuously updated during the training process, thereby achieving a dynamic dictionary. 2. **Consistency of dictionary keys**: To ensure that the keys (i.e., dictionary entries) in the dictionary remain consistent during training, MoCo introduces a momentum update mechanism. This makes the parameter changes of the encoder smoother, which helps improve the quality of the learned representations. 3. **Application of contrastive loss function**: From the perspective of the contrastive loss function, MoCo constructs a dynamic dictionary, enabling the model to learn to distinguish between positive and negative examples in an unsupervised manner. The goal of MoCo is to improve the effectiveness of unsupervised representation learning by constructing a large and consistent dynamic dictionary, aiming to narrow the gap between unsupervised learning and supervised learning. The paper demonstrates the performance of MoCo on multiple downstream tasks, including image classification and object detection, proving that it can effectively learn useful visual features and, in some tasks, even surpass supervised pre-trained models based on ImageNet.

Momentum Contrast for Unsupervised Visual Representation Learning

UniMoCo: Unsupervised, Semi-Supervised and Full-Supervised Visual Representation Learning.

Momentum Contrastive Pruning

CO2: Consistent Contrast for Unsupervised Visual Representation Learning

Improved Baselines with Momentum Contrastive Learning

Unsupervised Visual Representation Learning by Synchronous Momentum Grouping.

Kalman contrastive unsupervised representation learning

Self-Supervised Visual Representations Learning by Contrastive Mask Prediction

NeuroMoCo: A Neuromorphic Momentum Contrast Learning Method for Spiking Neural Networks

MoPro: Webly Supervised Learning with Momentum Prototypes

Multimodal Contrastive Training for Visual Representation Learning

Improved contrastive learning with MoCo framework

SynCo: Synthetic Hard Negatives in Contrastive Learning for Better Unsupervised Visual Representations

MOCOLNet: A Momentum Contrastive Learning Network for Multimodal Aspect-Level Sentiment Analysis

Improving Code Search with Multi-Modal Momentum Contrastive Learning

MoCo-CXR: MoCo Pretraining Improves Representation and Transferability of Chest X-ray Models

Multi-Grained Contrast for Data-Efficient Unsupervised Representation Learning

CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval

Seed the Views: Hierarchical Semantic Alignment for Contrastive Representation Learning

TS-MoCo: Time-Series Momentum Contrast for Self-Supervised Physiological Representation Learning

Modulated Contrast for Versatile Image Synthesis