Abstract:Contrastive learning is a paradigm for learning representations from unlabelled data that has been highly successful for image and text data. Several recent works have examined contrastive losses to claim that contrastive models effectively learn spectral embeddings, while few works show relations between (wide) contrastive models and kernel principal component analysis (PCA). However, it is not known if trained contrastive models indeed correspond to kernel methods or PCA. In this work, we analyze the training dynamics of two-layer contrastive models, with non-linear activation, and answer when these models are close to PCA or kernel methods. It is well known in the supervised setting that neural networks are equivalent to neural tangent kernel (NTK) machines, and that the NTK of infinitely wide networks remains constant during training. We provide the first convergence results of NTK for contrastive losses, and present a nuanced picture: NTK of wide networks remains almost constant for cosine similarity based contrastive losses, but not for losses based on dot product similarity. We further study the training dynamics of contrastive models with orthogonality constraints on output layer, which is implicitly assumed in works relating contrastive learning to spectral embedding. Our deviation bounds suggest that representations learned by contrastive models are close to the principal components of a certain matrix computed from random features. We empirically show that our theoretical results possibly hold beyond two-layer networks.

What problem does this paper attempt to address?

This paper discusses contrastive learning methods in unsupervised learning, especially in self-supervised learning (SSL). The goal of contrastive learning is to learn representations from unlabeled data so that semantically similar data points are close in the latent representation space. The paper focuses on analyzing the behavior of two-layer nonlinear neural networks trained with contrastive loss and investigates whether these models approximate PCA or kernel methods. The authors found that for contrastive loss based on cosine similarity, the neural tangent kernel (NTK) of wide networks remains almost unchanged during the training process, suggesting that contrastive learning models in this case can be approximated as methods with fixed deterministic kernels. However, if the loss is based on dot product similarity, NTK undergoes significant changes in a short period of time. Furthermore, the paper also investigates the training dynamics of models with output layer orthogonal constraint and proposes that some contrastive learning losses are equivalent to specific matrix principal component analysis. The main contributions of the paper include: 1. Analysis of the NTK for two-layer networks under contrastive and non-contrastive losses, providing results on the deviation of NTK after gradient descent steps from its initialization. 2. Investigation of the training dynamics of contrastive learning models under orthogonal constraint conditions, revealing some losses' connections to matrix principal component analysis. 3. Provision of empirical evidence suggesting that these theoretical results may not only apply to two-layer networks. In summary, the paper aims to address when the broad range of contrastive learning models can be approximated using neural tangent kernels and principal component analysis. It analyzes the behavior of these models during the training process and reveals their relationships with PCA and kernel methods.

When can we Approximate Wide Contrastive Models with Neural Tangent Kernels and Principal Component Analysis?

When and why PINNs fail to train: A neural tangent kernel perspective

Spectra of the Conjugate Kernel and Neural Tangent Kernel for linear-width neural networks

On the Empirical Neural Tangent Kernel of Standard Finite-Width Convolutional Neural Network Architectures

Equivariant Neural Tangent Kernels

Contrastive Learning Is Spectral Clustering On Similarity Graph

Bridging Mini-Batch and Asymptotic Analysis in Contrastive Learning: From InfoNCE to Kernel-Based Losses

A Fine-Grained Spectral Perspective on Neural Networks

On the Importance of Contrastive Loss in Multimodal Learning

Contrastive estimation reveals topic posterior information to linear models

An Exact Kernel Equivalence for Finite Classification Models

Neural Tangent Kernels Motivate Graph Neural Networks with Cross-Covariance Graphs

On the Disconnect Between Theory and Practice of Neural Networks: Limits of the NTK Perspective

Evolution of Neural Tangent Kernels under Benign and Adversarial Training

A Generalized Neural Tangent Kernel Analysis for Two-layer Neural Networks

When Do Neural Networks Outperform Kernel Methods?

Exact Convergence Rates of the Neural Tangent Kernel in the Large Depth Limit

Novel Kernel Models and Exact Representor Theory for Neural Networks Beyond the Over-Parameterized Regime

Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural Networks

Training Dynamics of Nonlinear Contrastive Learning Model in the High Dimensional Limit

Towards Understanding the Mechanism of Contrastive Learning via Similarity Structure: A Theoretical Analysis