When can we Approximate Wide Contrastive Models with Neural Tangent Kernels and Principal Component Analysis?

Gautham Govind Anil,Pascal Esser,Debarghya Ghoshdastidar
2024-03-14
Abstract:Contrastive learning is a paradigm for learning representations from unlabelled data that has been highly successful for image and text data. Several recent works have examined contrastive losses to claim that contrastive models effectively learn spectral embeddings, while few works show relations between (wide) contrastive models and kernel principal component analysis (PCA). However, it is not known if trained contrastive models indeed correspond to kernel methods or PCA. In this work, we analyze the training dynamics of two-layer contrastive models, with non-linear activation, and answer when these models are close to PCA or kernel methods. It is well known in the supervised setting that neural networks are equivalent to neural tangent kernel (NTK) machines, and that the NTK of infinitely wide networks remains constant during training. We provide the first convergence results of NTK for contrastive losses, and present a nuanced picture: NTK of wide networks remains almost constant for cosine similarity based contrastive losses, but not for losses based on dot product similarity. We further study the training dynamics of contrastive models with orthogonality constraints on output layer, which is implicitly assumed in works relating contrastive learning to spectral embedding. Our deviation bounds suggest that representations learned by contrastive models are close to the principal components of a certain matrix computed from random features. We empirically show that our theoretical results possibly hold beyond two-layer networks.
Machine Learning
What problem does this paper attempt to address?
This paper discusses contrastive learning methods in unsupervised learning, especially in self-supervised learning (SSL). The goal of contrastive learning is to learn representations from unlabeled data so that semantically similar data points are close in the latent representation space. The paper focuses on analyzing the behavior of two-layer nonlinear neural networks trained with contrastive loss and investigates whether these models approximate PCA or kernel methods. The authors found that for contrastive loss based on cosine similarity, the neural tangent kernel (NTK) of wide networks remains almost unchanged during the training process, suggesting that contrastive learning models in this case can be approximated as methods with fixed deterministic kernels. However, if the loss is based on dot product similarity, NTK undergoes significant changes in a short period of time. Furthermore, the paper also investigates the training dynamics of models with output layer orthogonal constraint and proposes that some contrastive learning losses are equivalent to specific matrix principal component analysis. The main contributions of the paper include: 1. Analysis of the NTK for two-layer networks under contrastive and non-contrastive losses, providing results on the deviation of NTK after gradient descent steps from its initialization. 2. Investigation of the training dynamics of contrastive learning models under orthogonal constraint conditions, revealing some losses' connections to matrix principal component analysis. 3. Provision of empirical evidence suggesting that these theoretical results may not only apply to two-layer networks. In summary, the paper aims to address when the broad range of contrastive learning models can be approximated using neural tangent kernels and principal component analysis. It analyzes the behavior of these models during the training process and reveals their relationships with PCA and kernel methods.