InfoNCE: Identifying the Gap Between Theory and Practice

Evgenia Rusak,Patrik Reizinger,Attila Juhos,Oliver Bringmann,Roland S. Zimmermann,Wieland Brendel
2024-06-29
Abstract:Previous theoretical work on contrastive learning (CL) with InfoNCE showed that, under certain assumptions, the learned representations uncover the ground-truth latent factors. We argue these theories overlook crucial aspects of how CL is deployed in practice. Specifically, they assume that within a positive pair, all latent factors either vary to a similar extent, or that some do not vary at all. However, in practice, positive pairs are often generated using augmentations such as strong cropping to just a few pixels. Hence, a more realistic assumption is that all latent factors change, with a continuum of variability across these factors. We introduce AnInfoNCE, a generalization of InfoNCE that can provably uncover the latent factors in this anisotropic setting, broadly generalizing previous identifiability results in CL. We validate our identifiability results in controlled experiments and show that AnInfoNCE increases the recovery of previously collapsed information in CIFAR10 and ImageNet, albeit at the cost of downstream accuracy. Additionally, we explore and discuss further mismatches between theoretical assumptions and practical implementations, including extensions to hard negative mining and loss ensembles.
Machine Learning,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the gap between theory and practice in Contrastive Learning (CL). Specifically, existing theoretical work assumes that in positive sample pairs, all latent factors either change to a similar extent or do not change at all. However, in practical applications, positive sample pairs are usually generated through augmentation techniques, such as aggressive cropping to only a few pixels, which causes all latent factors to change but to different extents. This heterogeneity in practical applications has not been adequately considered by existing theories. To address this issue, the authors introduce a new contrastive loss function—AnInfoNCE, which can identify latent factors under such heterogeneous settings, thereby broadly generalizing previous identifiability results. Additionally, the authors explore other mismatches between theoretical assumptions and practical implementations, including hard negative mining and the extension of the loss set. ### Main Contributions: 1. **Introduction of AnInfoNCE**: A generalized, identifiable contrastive loss function assuming the distribution of positive sample pairs is heterogeneous. 2. **Proposed Hard Negative Mining Model**: This model is theoretically identifiable and extends the main identifiability results to the loss set. 3. **Experimental Validation**: The effectiveness of the new loss function is validated on synthetic data and image experiments, demonstrating the ability to recover latent factors on CIFAR10 and ImageNet, although downstream classification accuracy decreases. 4. **Discussion of Remaining Gaps Between Theory and Practice**: Analyzes the impact of using augmentation techniques on real data and explores strategies to further bridge the gap between theory and practice. ### Experimental Results: - **Synthetic Experiments**: On synthetic data, AnInfoNCE shows high linear identifiability (R² scores) for both content and style latent factors across a wide range of concentration parameters, whereas the standard InfoNCE loss fails to identify style latent factors. - **MNIST Experiments**: On the MNIST dataset, AnInfoNCE perfectly identifies all latent factors, while the standard InfoNCE loss performs poorly in identifying style latent factors. - **Real-World Experiments**: On CIFAR10 and ImageNet, AnInfoNCE performs better in terms of augmentation readout accuracy, successfully recovering more latent dimensions, but downstream classification accuracy does not improve and even decreases. ### Analysis: While AnInfoNCE performs excellently in some controlled scenarios, a trade-off between augmentation readout accuracy and linear classification readout accuracy is observed on real-world datasets like CIFAR10 and ImageNet. Although higher augmentation readout accuracy indicates better capture of style latent factors, it does not translate to higher classification accuracy. This phenomenon may be related to the augmentation techniques used in real data, requiring further research to bridge the gap between theory and practice.