On the Comparison between Multi-modal and Single-modal Contrastive Learning

Wei Huang,Andi Han,Yongqiang Chen,Yuan Cao,Zhiqiang Xu,Taiji Suzuki
2024-11-05
Abstract:Multi-modal contrastive learning with language supervision has presented a paradigm shift in modern machine learning. By pre-training on a web-scale dataset, multi-modal contrastive learning can learn high-quality representations that exhibit impressive robustness and transferability. Despite its empirical success, the theoretical understanding is still in its infancy, especially regarding its comparison with single-modal contrastive learning. In this work, we introduce a feature learning theory framework that provides a theoretical foundation for understanding the differences between multi-modal and single-modal contrastive learning. Based on a data generation model consisting of signal and noise, our analysis is performed on a ReLU network trained with the InfoMax objective function. Through a trajectory-based optimization analysis and generalization characterization on downstream tasks, we identify the critical factor, which is the signal-to-noise ratio (SNR), that impacts the generalizability in downstream tasks of both multi-modal and single-modal contrastive learning. Through the cooperation between the two modalities, multi-modal learning can achieve better feature learning, leading to improvements in performance in downstream tasks compared to single-modal learning. Our analysis provides a unified framework that can characterize the optimization and generalization of both single-modal and multi-modal contrastive learning. Empirical experiments on both synthetic and real-world datasets further consolidate our theoretical findings.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to establish a systematic theoretical framework between multi - modal contrastive learning and unimodal contrastive learning in order to understand the differences in their optimization and generalization abilities. Specifically, researchers hope to theoretically explain why multi - modal contrastive learning usually performs better than unimodal contrastive learning in downstream tasks, especially in cases with a low signal - to - noise ratio (SNR). By constructing a data - generation model that includes signals and noise and training based on ReLU networks, the authors analyzed the trajectories of the two learning methods during the gradient - descent training process and their impact on the generalization performance of downstream tasks. The main contributions of the paper are as follows: 1. **Establishment of Systematic Optimization Analysis**: For the first time, in a non - convex setting, a systematic comparative optimization analysis of unimodal and multi - modal contrastive learning was carried out through gradient - descent training. The research shows that whether it is unimodal or multi - modal, near - zero training error can be achieved after a polynomial number of iterations by overcoming non - convex difficulties. 2. **Differences in Feature Learning and Generalization**: Through the trajectory analysis of ReLU networks learning signals from data and memorizing noise, the differences in the generalization abilities of unimodal and multi - modal contrastive learning were successfully characterized. The SNR differences in different modalities lead to the divergence of the two contrastive learning frameworks in the generalization of downstream tasks. 3. **Advantages of Multi - modality**: Theoretical analysis shows that the advantages of multi - modal contrastive learning come from the high quality of the second modality and the cooperation between the two modalities. This cooperation enables multi - modal contrastive learning to better learn useful features, thus showing better generalization ability in downstream tasks. Experimental results further verify this theoretical finding. In general, this paper proves the advantages of multi - modal contrastive learning in generalization ability through theoretical analysis and experiments, and provides a unified framework to understand and compare the optimization and generalization characteristics of unimodal and multi - modal contrastive learning.