Optimal Sample Complexity of Contrastive Learning

Noga Alon,Dmitrii Avdiukhin,Dor Elboim,Orr Fischer,Grigory Yaroslavtsev
2023-12-01
Abstract:Contrastive learning is a highly successful technique for learning representations of data from labeled tuples, specifying the distance relations within the tuple. We study the sample complexity of contrastive learning, i.e. the minimum number of labeled tuples sufficient for getting high generalization accuracy. We give tight bounds on the sample complexity in a variety of settings, focusing on arbitrary distance functions, both general $\ell_p$-distances, and tree metrics. Our main result is an (almost) optimal bound on the sample complexity of learning $\ell_p$-distances for integer $p$. For any $p \ge 1$ we show that $\tilde \Theta(\min(nd,n^2))$ labeled tuples are necessary and sufficient for learning $d$-dimensional representations of $n$-point datasets. Our results hold for an arbitrary distribution of the input samples and are based on giving the corresponding bounds on the Vapnik-Chervonenkis/Natarajan dimension of the associated problems. We further show that the theoretical bounds on sample complexity obtained via VC/Natarajan dimension can have strong predictive power for experimental results, in contrast with the folklore belief about a substantial gap between the statistical learning theory and the practice of deep learning.
Machine Learning
What problem does this paper attempt to address?
This paper investigates the problem of sample complexity in contrastive learning, which is the minimum number of samples required to learn the distance relationship of labeled tuples that represent data. The research focuses primarily on the sample complexity of arbitrary distance functions in various settings, especially the analysis of Euclidean space (ℓp distance) and tree metrics. The main contribution of the paper is to provide nearly optimal bounds on the sample complexity of integer ℓp distances, indicating that the minimum number of samples required to learn an n-point dataset with a d-dimensional representation is bounded by the ratio of n to d for any p≥1. These results are applicable to any input sample distribution and are based on the Vapnik-Chervonenkis/Natarajan dimensions of related problems. The paper points out that despite recent attention to the theoretical foundations of contrastive learning, most work has approached this problem from other perspectives, such as loss function design and transfer learning. The paper emphasizes the importance of sample complexity in deep learning, as the cost of obtaining samples remains a major consideration even when class labels are available, since training cost is linearly correlated with sample quantity. In addition, for certain settings, sample complexity may directly correspond to annotation cost. The main results of the paper first address the case of k=1, then extend to general values of k. The theoretical results include upper and lower bounds on the sample complexity for arbitrary distances, Euclidean distances, cosine similarity, and tree metrics. The authors also demonstrate how these theoretical bounds align with experimental results, validating the predictive power of classical PAC learning theory in deep learning practice. In short, this paper addresses the question of how many samples are needed in contrastive learning to learn a good distance function and provides precise bounds on the sample complexity for different distance functions and dataset sizes.