Abstract:Cross-modal contrastive learning in vision language pretraining (VLP) faces the challenge of (partial) false negatives. In this paper, we study this problem from the perspective of Mutual Information (MI) optimization. It is common sense that InfoNCE loss used in contrastive learning will maximize the lower bound of MI between anchors and their positives, while we theoretically prove that MI involving negatives also matters when noises commonly exist. Guided by a more general lower bound form for optimization, we propose a contrastive learning strategy regulated by progressively refined cross-modal similarity, to more accurately optimize MI between an image/text anchor and its negative texts/images instead of improperly minimizing it. Our method performs competitively on four downstream cross-modal tasks and systematically balances the beneficial and harmful effects of (partial) false negative samples under theoretical guidance.

What problem does this paper attempt to address?

### The problems the paper attempts to solve This paper attempts to solve the problem of partial false negatives in cross - modal contrastive learning in Vision Language Pretraining (VLP). Specifically, the paper re - examines this problem from the perspective of Mutual Information (MI) optimization and proposes a new contrastive learning strategy to more accurately optimize the mutual information between images and text anchors and their negative samples through gradually refined cross - modal similarity adjustment. ### Problem background In vision - language pre - training, Self - supervised Multi - modal Contrastive Learning (SMCL) has made significant progress through image - to - text and text - to - image contrastive learning. However, this contrastive strategy has a problem when dealing with one - to - many correspondence relationships between images and texts: some negative samples are actually semantically consistent or partially consistent, and these partial false negatives will impede the contrast effect, resulting in sub - optimal cross - modal representations. ### Solutions 1. **Theoretical analysis**: - The authors re - examine the InfoNCE loss function from the perspective of mutual information optimization. They prove that in the presence of non - negligible partial false negatives, optimizing InfoNCE is equivalent to maximizing the lower bound of the mutual information difference (MI - P - MI - N). This reveals that the traditional contrastive strategy may overly minimize the mutual information between partial false negatives and anchors, thus affecting the degree of structuring of the representation space. 2. **New contrastive learning strategy**: - Based on the above theoretical analysis, the authors propose a new contrastive learning strategy, namely Similarity - Regulated Contrastive Learning (SRCL). Specifically, they introduce a contrastive weight, which is based on cross - modal similarity and gradually refined with training, to adjust the contrast effect of each negative sample. This regulator can guide the model to appropriately optimize MI - N, avoid it being accidentally minimized, and thus generate a more semantically structured representation space. ### Experimental results 1. **Downstream task performance**: - The authors evaluate the effect of SRCL on multiple downstream tasks, including Visual Question Answering (VQA), Cross - modal Retrieval, Zero - shot Cross - modal Retrieval, and Natural Language for Visual Reasoning (NLVR). The experimental results show that SRCL significantly improves performance on these tasks. 2. **Quantitative analysis**: - The authors verify the effectiveness of the method by removing different proportions of false negative samples (or hard negative samples). The experimental results show that moderately removing false negative samples can improve the performance of downstream tasks, but excessive removal will lead to performance degradation. This verifies that SRCL systematically balances the beneficial and harmful effects of false negative samples through cross - modal similarity regulation. 3. **Qualitative analysis**: - The authors also conduct a qualitative analysis by visualizing the zero - shot text - image retrieval results. The results show that SRCL can more comprehensively capture the potential similarities between text descriptions and images, and the ranking of retrieval results reflects the trend from full match to partial match, further verifying the effectiveness of the method. ### Summary This paper proves the adverse effects of partial false negatives in cross - modal contrastive learning through theoretical analysis and experiments, and proposes a new contrastive learning strategy SRCL. By optimizing mutual information through cross - modal similarity regulation, it significantly improves the performance of vision - language pre - training models on multiple downstream tasks.

Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation

Exploiting Pseudo Image Captions for Multimodal Summarization.

Contrastive Visual-Linguistic Pretraining

Vision-Language Pre-Training with Triple Contrastive Learning

CAVL: Learning Contrastive and Adaptive Representations of Vision and Language

Noise-robust Vision-language Pre-training with Positive-negative Learning

Dense Contrastive Visual-Linguistic Pretraining

Unified Loss of Pair Similarity Optimization for Vision-Language Retrieval

Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

Improving Medical Vision-Language Contrastive Pretraining with Semantics-aware Triage

Exploring Transferability of Multimodal Adversarial Samples for Vision-Language Pre-training Models with Contrastive Learning

CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training

Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrieval

Vision-Language Pre-Training: Basics, Recent Advances, and Future Trends

Contrastive Vision-Language Alignment Makes Efficient Instruction Learner

Multimodal Contrastive Training for Visual Representation Learning

VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix

FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models

Enhancing Fine-Grained Vision-Language Pretraining with Negative Augmented Samples

Modeling Caption Diversity in Contrastive Vision-Language Pretraining

Unifying Cross-Lingual and Cross-Modal Modeling Towards Weakly Supervised Multilingual Vision-Language Pre-training