Bridging Mini-Batch and Asymptotic Analysis in Contrastive Learning: From InfoNCE to Kernel-Based Losses

Panagiotis Koromilas,Giorgos Bouritsas,Theodoros Giannakopoulos,Mihalis Nicolaou,Yannis Panagakis
2024-05-28
Abstract:What do different contrastive learning (CL) losses actually optimize for? Although multiple CL methods have demonstrated remarkable representation learning capabilities, the differences in their inner workings remain largely opaque. In this work, we analyse several CL families and prove that, under certain conditions, they admit the same minimisers when optimizing either their batch-level objectives or their expectations asymptotically. In both cases, an intimate connection with the hyperspherical energy minimisation (HEM) problem resurfaces. Drawing inspiration from this, we introduce a novel CL objective, coined Decoupled Hyperspherical Energy Loss (DHEL). DHEL simplifies the problem by decoupling the target hyperspherical energy from the alignment of positive examples while preserving the same theoretical guarantees. Going one step further, we show the same results hold for another relevant CL family, namely kernel contrastive learning (KCL), with the additional advantage of the expected loss being independent of batch size, thus identifying the minimisers in the non-asymptotic regime. Empirical results demonstrate improved downstream performance and robustness across combinations of different batch sizes and hyperparameters and reduced dimensionality collapse, on several computer vision datasets.
Machine Learning,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve the problems faced by different loss functions in the optimization process of contrastive learning (CL). Specifically, the paper mainly focuses on the following aspects: 1. **Optimization Objectives of Different Contrastive Learning Loss Functions**: - Although multiple contrastive learning methods have demonstrated excellent representation learning capabilities, the differences in their internal mechanisms remain opaque. By analyzing several contrastive learning methods, the paper proves that under certain conditions, these methods have the same minimum solutions when optimizing batch - level objectives or their expected values. These minimum solutions are closely related to the Hyperspherical Energy Minimisation (HEM) problem. 2. **Introduction of a New Contrastive Learning Objective**: - The paper proposes a new contrastive learning objective, called Decoupled Hyperspherical Energy Loss (DHEL). DHEL simplifies the problem by decoupling the target hyperspherical energy from positive - sample alignment and retains the same theoretical guarantees. 3. **Analysis of Kernel Contrastive Learning (KCL)**: - The paper further analyzes another related class of contrastive learning methods, namely kernel contrastive learning. The results show that KCL can also identify the minimum solution in the non - asymptotic case, and its expected loss is independent of the batch size, thus solving the limitations in the asymptotic analysis. 4. **Experimental Verification**: - Through experiments on multiple computer vision datasets, the paper verifies the downstream performance and robustness of DHEL and KCL under different batch sizes and hyperparameter combinations, and reduces the problem of representation dimension collapse. ### Specific Problems and Solutions 1. **Consistency of Batch Size and Asymptotic Behavior**: - The paper explores the optimal solutions in both single - batch and asymptotic - expectation cases. In the finite - batch case, when the batch size does not exceed the ambient dimension plus 1, multiple InfoNCE variants share the same unique optimal solution. In the asymptotic case, these variants also exhibit the same behavior, namely perfect alignment and uniform distribution. 2. **Impact of Decoupling Positive and Negative Samples**: - DHEL simplifies the optimization process by decoupling the impact of positive and negative samples. Specifically, DHEL replaces the classical InfoNCE denominator with a denominator containing only negative samples, thereby eliminating the dependence on positive samples. This enables the alignment term and the uniformity term to be optimized independently, improving the optimization efficiency. 3. **Advantages of Kernel Contrastive Learning**: - KCL can also identify the minimum solution in the non - asymptotic case, and its expected loss is independent of the batch size. This means that KCL is more flexible in practical applications and is not limited by the batch size. ### Experimental Results - **Performance and Robustness**: - The experimental results show that DHEL and KCL exhibit superior performance and robustness under different batch sizes and hyperparameter combinations. - **Reduction of Dimension Collapse**: - DHEL and KCL effectively utilize more dimensions and reduce the problem of representation dimension collapse. In summary, through theoretical analysis and experimental verification, this paper proposes a new contrastive learning objective DHEL and further explores the advantages of kernel contrastive learning KCL, providing new ideas for the optimization of contrastive learning methods.