Abstract:Multi-modal contrastive learning (MMCL) has recently garnered considerable interest due to its superior performance in visual tasks, achieved by embedding multi-modal data, such as visual-language pairs. However, there still lack theoretical understandings of how MMCL extracts useful visual representation from multi-modal pairs, and particularly, how MMCL outperforms previous approaches like self-supervised contrastive learning (SSCL). In this paper, by drawing an intrinsic connection between MMCL and asymmetric matrix factorization, we establish the first generalization guarantees of MMCL for visual downstream tasks. Based on this framework, we further unify MMCL and SSCL by showing that MMCL implicitly performs SSCL with (pseudo) positive pairs induced by text pairs. Through this unified perspective, we characterize the advantage of MMCL by showing that text pairs induce more semantically consistent and diverse positive pairs, which, according to our analysis, provably benefit downstream generalization. Inspired by this finding, we propose CLIP-guided resampling methods to significantly improve the downstream performance of SSCL on ImageNet by leveraging multi-modal information. Code is available at <a class="link-external link-https" href="https://github.com/PKU-ML/CLIP-Help-SimCLR" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the lack of theoretical understanding of the generalization ability of multi - modal contrastive learning (MMCL) in visual tasks. Specifically, although MMCL performs excellently in visual tasks, there is still a lack of theoretical understanding of how it extracts useful visual representations from multi - modal data pairs and why it can outperform previous methods such as self - supervised contrastive learning (SSCL). By establishing the intrinsic connection between MMCL and asymmetric matrix factorization, the paper provides, for the first time, a generalization guarantee of MMCL in visual downstream tasks, and further unifies MMCL and SSCL, showing that text pairs can induce more semantically consistent and diverse positive sample pairs, which is helpful for the generalization performance of downstream tasks. ### Main Contributions: 1. **Theoretical Guarantee**: Established the first generalization theoretical guarantee for multi - modal contrastive learning (MMCL), providing a new perspective by connecting its objective function with the asymmetric matrix factorization objective. 2. **Unified Perspective**: Provided a unified perspective for understanding and comparing multi - modal contrastive learning and self - supervised contrastive learning, explaining why MMCL performs better in downstream tasks. 3. **Empirical Verification**: Proposed a new method of using multi - modal information in pre - trained models (such as CLIP) to guide self - supervised learning (such as SimCLR), and achieved significant performance improvement on ImageNet. ### Theoretical Framework: - **Mathematical Modeling**: Reformulated the objective function of multi - modal contrastive learning as an asymmetric matrix factorization problem, revealing that MMCL is essentially learning the low - rank decomposition of the joint distribution. - **Ideal Representation**: Determined the ideal representation form of MMCL through singular value decomposition (SVD) and provided a generalization guarantee in linear probing tasks. - **Generalization Boundary**: Defined two key factors, label error and singular value, which affect the generalization performance of multi - modal pre - training tasks in downstream tasks. ### Experimental Verification: - **Data Generation**: By comparing the positive sample pairs generated by CLIP and SimCLR, showed the advantages of text - induced positive sample pairs in semantic consistency and diversity. - **Performance Improvement**: Proposed four different techniques to use multi - modal information to guide self - supervised learning, significantly improving the performance of SimCLR on ImageNet. In conclusion, through theoretical analysis and experimental verification, this paper not only provides the first generalization theoretical guarantee for multi - modal contrastive learning, but also explains the reasons why it is superior to self - supervised contrastive learning in downstream tasks, and proposes specific improvement methods.

On the Generalization of Multi-modal Contrastive Learning

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

MXM-CLR: A Unified Framework for Contrastive Learning of Multifold Cross-Modal Representations

Multiple Contrastive Learning for Multimodal Sentiment Analysis

Multimodal Contrastive In-Context Learning

Explaining and Mitigating the Modality Gap in Contrastive Multimodal Learning

Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal Prediction for Multimodal Sentiment Analysis

Multimodal Pretraining from Monolingual to Multilingual

Multimodal Multilabel Classification by CLIP

Improving Spoken Language Understanding with Cross-Modal Contrastive Learning

Understanding Dark Scenes by Contrasting Multi-Modal Observations

Cross-modal contrastive learning for multimodal sentiment recognition

AsCL: An Asymmetry-sensitive Contrastive Learning Method for Image-Text Retrieval with Cross-Modal Fusion

Learning the Unlearned: Mitigating Feature Suppression in Contrastive Learning

Deep Contrastive Representation Learning for Multi-Modal Clustering

Connecting Multi-modal Contrastive Representations

Multi-Trusted Cross-Modal Information Bottleneck for 3D Self-Supervised Representation Learning

Contrastive Learning Based Visual Representation Enhancement for Multimodal Machine Translation

CLMLF: A Contrastive Learning and Multi-Layer Fusion Method for Multimodal Sentiment Detection

Linking Representations with Multimodal Contrastive Learning

CLMLF:A Contrastive Learning and Multi-Layer Fusion Method for Multimodal Sentiment Detection