Abstract:Contrastive learning typically matches pairs of related views among a number of unrelated negative views. Views can be generated (e.g. by augmentations) or be observed. We investigate matching when there are more than two related views which we call poly-view tasks, and derive new representation learning objectives using information maximization and sufficient statistics. We show that with unlimited computation, one should maximize the number of related views, and with a fixed compute budget, it is beneficial to decrease the number of unique samples whilst increasing the number of views of those samples. In particular, poly-view contrastive models trained for 128 epochs with batch size 256 outperform SimCLR trained for 1024 epochs at batch size 4096 on ImageNet1k, challenging the belief that contrastive models require large batch sizes and many training epochs.

What problem does this paper attempt to address?

This paper discusses the problem of poly-view contrastive learning, which extends the existing contrastive learning framework to leverage multiple relevant views of the same data instance. Traditional contrastive learning typically deals with pairwise matching views, while this paper proposes how to design representation learning tasks in the presence of multiple relevant views. The authors propose new representation learning objectives through methods of information maximization and sufficient statistics, which go beyond pairwise matching and consider all views. The main contributions of the paper are as follows: 1. Generalizing the information-theoretic foundation to multi-view tasks, which leads to a new family of representation learning algorithms. 2. Providing an alternative perspective for multi-view contrastive learning from the viewpoint of sufficient statistics, and introducing a new loss function. When the number of views is 2, this loss function reduces to the well-known SimCLR loss, thus providing a new interpretation for contrastive learning. 3. Experimental results show that in image representation learning, higher view multiplicity can create a new computational Pareto frontier, indicating that reducing the number of samples while increasing the number of views per sample is beneficial under limited computational budget. Specifically, using a multi-view contrastive learning model with 128 training epochs and a batch size of 256 outperforms SimCLR with 1024 training epochs and a batch size of 4096. The paper also investigates the impact of different numbers of views, indicating that increasing the number of views can improve the ratio of gradient signal to noise and enhance model performance, but it does not directly increase the lower bound of mutual information. The authors analyze and design new learning objectives using information gain, multi-view conditional independence, and lower bounds in information theory.

Poly-View Contrastive Learning

What Makes for Good Views for Contrastive Learning?

Enhancing Contrastive Learning with Efficient Combinatorial Positive Pairing

What makes for good views for contrastive learning

Robust Contrastive Learning against Noisy Views

Contrastive Multiview Coding

Adaptive Multi-head Contrastive Learning

Rethinking Positive Pairs in Contrastive Learning

Contrastive Quant

A Unified Framework for Contrastive Learning from a Perspective of Affinity Matrix

On the Importance of Contrastive Loss in Multimodal Learning

P4Contrast: Contrastive Learning with Pairs of Point-Pixel Pairs for RGB-D Scene Understanding

Crafting Better Contrastive Views for Siamese Representation Learning

$\mathbb{X}$-Sample Contrastive Loss: Improving Contrastive Learning with Sample Similarity Graphs

Contrastive Learning of Visual-Semantic Embeddings

Contrastive Learning for Non-Local Graphs with Multi-Resolution Structural Views

Contrastive Quant: Quantization Makes Stronger Contrastive Learning

Video Contrastive Learning with Global Context

Contrastive Learning Via Equivariant Representation

Dual Contrastive Prediction for Incomplete Multi-View Representation Learning

A multi-view contrastive learning for heterogeneous network embedding