On the Generalization of Multi-modal Contrastive Learning

Qi Zhang,Yifei Wang,Yisen Wang
2023-06-07
Abstract:Multi-modal contrastive learning (MMCL) has recently garnered considerable interest due to its superior performance in visual tasks, achieved by embedding multi-modal data, such as visual-language pairs. However, there still lack theoretical understandings of how MMCL extracts useful visual representation from multi-modal pairs, and particularly, how MMCL outperforms previous approaches like self-supervised contrastive learning (SSCL). In this paper, by drawing an intrinsic connection between MMCL and asymmetric matrix factorization, we establish the first generalization guarantees of MMCL for visual downstream tasks. Based on this framework, we further unify MMCL and SSCL by showing that MMCL implicitly performs SSCL with (pseudo) positive pairs induced by text pairs. Through this unified perspective, we characterize the advantage of MMCL by showing that text pairs induce more semantically consistent and diverse positive pairs, which, according to our analysis, provably benefit downstream generalization. Inspired by this finding, we propose CLIP-guided resampling methods to significantly improve the downstream performance of SSCL on ImageNet by leveraging multi-modal information. Code is available at <a class="link-external link-https" href="https://github.com/PKU-ML/CLIP-Help-SimCLR" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the lack of theoretical understanding of the generalization ability of multi - modal contrastive learning (MMCL) in visual tasks. Specifically, although MMCL performs excellently in visual tasks, there is still a lack of theoretical understanding of how it extracts useful visual representations from multi - modal data pairs and why it can outperform previous methods such as self - supervised contrastive learning (SSCL). By establishing the intrinsic connection between MMCL and asymmetric matrix factorization, the paper provides, for the first time, a generalization guarantee of MMCL in visual downstream tasks, and further unifies MMCL and SSCL, showing that text pairs can induce more semantically consistent and diverse positive sample pairs, which is helpful for the generalization performance of downstream tasks. ### Main Contributions: 1. **Theoretical Guarantee**: Established the first generalization theoretical guarantee for multi - modal contrastive learning (MMCL), providing a new perspective by connecting its objective function with the asymmetric matrix factorization objective. 2. **Unified Perspective**: Provided a unified perspective for understanding and comparing multi - modal contrastive learning and self - supervised contrastive learning, explaining why MMCL performs better in downstream tasks. 3. **Empirical Verification**: Proposed a new method of using multi - modal information in pre - trained models (such as CLIP) to guide self - supervised learning (such as SimCLR), and achieved significant performance improvement on ImageNet. ### Theoretical Framework: - **Mathematical Modeling**: Reformulated the objective function of multi - modal contrastive learning as an asymmetric matrix factorization problem, revealing that MMCL is essentially learning the low - rank decomposition of the joint distribution. - **Ideal Representation**: Determined the ideal representation form of MMCL through singular value decomposition (SVD) and provided a generalization guarantee in linear probing tasks. - **Generalization Boundary**: Defined two key factors, label error and singular value, which affect the generalization performance of multi - modal pre - training tasks in downstream tasks. ### Experimental Verification: - **Data Generation**: By comparing the positive sample pairs generated by CLIP and SimCLR, showed the advantages of text - induced positive sample pairs in semantic consistency and diversity. - **Performance Improvement**: Proposed four different techniques to use multi - modal information to guide self - supervised learning, significantly improving the performance of SimCLR on ImageNet. In conclusion, through theoretical analysis and experimental verification, this paper not only provides the first generalization theoretical guarantee for multi - modal contrastive learning, but also explains the reasons why it is superior to self - supervised contrastive learning in downstream tasks, and proposes specific improvement methods.