Momentum Contrast for Unsupervised Visual Representation Learning

Kaiming He,Haoqi Fan,Yuxin Wu,Saining Xie,Ross Girshick
2020-03-24
Abstract:We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up, we build a dynamic dictionary with a queue and a moving-averaged encoder. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning. MoCo provides competitive results under the common linear protocol on ImageNet classification. More importantly, the representations learned by MoCo transfer well to downstream tasks. MoCo can outperform its supervised pre-training counterpart in 7 detection/segmentation tasks on PASCAL VOC, COCO, and other datasets, sometimes surpassing it by large margins. This suggests that the gap between unsupervised and supervised representation learning has been largely closed in many vision tasks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily addresses the problem of unsupervised representation learning in the field of computer vision, specifically how to use contrastive loss functions to construct high-quality visual feature representations. Specifically, the paper proposes the Momentum Contrast (MoCo) method, which is a new framework for unsupervised visual representation learning. The core contributions of MoCo are: 1. **Construction of a dynamic dictionary**: By maintaining a sample queue as a dictionary and using the current mini-batch data to update the queue, the dictionary can be continuously updated during the training process, thereby achieving a dynamic dictionary. 2. **Consistency of dictionary keys**: To ensure that the keys (i.e., dictionary entries) in the dictionary remain consistent during training, MoCo introduces a momentum update mechanism. This makes the parameter changes of the encoder smoother, which helps improve the quality of the learned representations. 3. **Application of contrastive loss function**: From the perspective of the contrastive loss function, MoCo constructs a dynamic dictionary, enabling the model to learn to distinguish between positive and negative examples in an unsupervised manner. The goal of MoCo is to improve the effectiveness of unsupervised representation learning by constructing a large and consistent dynamic dictionary, aiming to narrow the gap between unsupervised learning and supervised learning. The paper demonstrates the performance of MoCo on multiple downstream tasks, including image classification and object detection, proving that it can effectively learn useful visual features and, in some tasks, even surpass supervised pre-trained models based on ImageNet.