Abstract:Multi-modal contrastive learning with language supervision has presented a paradigm shift in modern machine learning. By pre-training on a web-scale dataset, multi-modal contrastive learning can learn high-quality representations that exhibit impressive robustness and transferability. Despite its empirical success, the theoretical understanding is still in its infancy, especially regarding its comparison with single-modal contrastive learning. In this work, we introduce a feature learning theory framework that provides a theoretical foundation for understanding the differences between multi-modal and single-modal contrastive learning. Based on a data generation model consisting of signal and noise, our analysis is performed on a ReLU network trained with the InfoMax objective function. Through a trajectory-based optimization analysis and generalization characterization on downstream tasks, we identify the critical factor, which is the signal-to-noise ratio (SNR), that impacts the generalizability in downstream tasks of both multi-modal and single-modal contrastive learning. Through the cooperation between the two modalities, multi-modal learning can achieve better feature learning, leading to improvements in performance in downstream tasks compared to single-modal learning. Our analysis provides a unified framework that can characterize the optimization and generalization of both single-modal and multi-modal contrastive learning. Empirical experiments on both synthetic and real-world datasets further consolidate our theoretical findings.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to establish a systematic theoretical framework between multi - modal contrastive learning and unimodal contrastive learning in order to understand the differences in their optimization and generalization abilities. Specifically, researchers hope to theoretically explain why multi - modal contrastive learning usually performs better than unimodal contrastive learning in downstream tasks, especially in cases with a low signal - to - noise ratio (SNR). By constructing a data - generation model that includes signals and noise and training based on ReLU networks, the authors analyzed the trajectories of the two learning methods during the gradient - descent training process and their impact on the generalization performance of downstream tasks. The main contributions of the paper are as follows: 1. **Establishment of Systematic Optimization Analysis**: For the first time, in a non - convex setting, a systematic comparative optimization analysis of unimodal and multi - modal contrastive learning was carried out through gradient - descent training. The research shows that whether it is unimodal or multi - modal, near - zero training error can be achieved after a polynomial number of iterations by overcoming non - convex difficulties. 2. **Differences in Feature Learning and Generalization**: Through the trajectory analysis of ReLU networks learning signals from data and memorizing noise, the differences in the generalization abilities of unimodal and multi - modal contrastive learning were successfully characterized. The SNR differences in different modalities lead to the divergence of the two contrastive learning frameworks in the generalization of downstream tasks. 3. **Advantages of Multi - modality**: Theoretical analysis shows that the advantages of multi - modal contrastive learning come from the high quality of the second modality and the cooperation between the two modalities. This cooperation enables multi - modal contrastive learning to better learn useful features, thus showing better generalization ability in downstream tasks. Experimental results further verify this theoretical finding. In general, this paper proves the advantages of multi - modal contrastive learning in generalization ability through theoretical analysis and experiments, and provides a unified framework to understand and compare the optimization and generalization characteristics of unimodal and multi - modal contrastive learning.

On the Comparison between Multi-modal and Single-modal Contrastive Learning

Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal Prediction for Multimodal Sentiment Analysis

Learning from the Global View: Supervised Contrastive Learning of Multimodal Representation

On the duality between contrastive and non-contrastive self-supervised learning

On the Generalization of Multi-modal Contrastive Learning

Understanding Dark Scenes by Contrasting Multi-Modal Observations

Learning the Unlearned: Mitigating Feature Suppression in Contrastive Learning

Improving the Modality Representation with Multi-View Contrastive Learning for Multimodal Sentiment Analysis

Turbo your multi-modal classification with contrastive learning

Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning.

Hybrid Contrastive Learning of Tri-Modal Representation for Multimodal Sentiment Analysis

Multi-Trusted Cross-Modal Information Bottleneck for 3D Self-Supervised Representation Learning

A Theory of Multimodal Learning

What to align in multimodal contrastive learning?

Multi-level cross-modal contrastive learning for review-aware recommendation

Multimodal Contrastive Training for Visual Representation Learning

Explaining and Mitigating the Modality Gap in Contrastive Multimodal Learning

Contrastive Learning on Multimodal Analysis of Electronic Health Records

Contrastive Learning for Multi-Modal Automatic Code Review

On the Importance of Contrastive Loss in Multimodal Learning

Cross-modal contrastive learning for multimodal sentiment recognition