Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation

Chaoya Jiang,Wei Ye,Haiyang Xu,Miang yan,Shikun Zhang,Jie Zhang,Fei Huang
2023-06-22
Abstract:Cross-modal contrastive learning in vision language pretraining (VLP) faces the challenge of (partial) false negatives. In this paper, we study this problem from the perspective of Mutual Information (MI) optimization. It is common sense that InfoNCE loss used in contrastive learning will maximize the lower bound of MI between anchors and their positives, while we theoretically prove that MI involving negatives also matters when noises commonly exist. Guided by a more general lower bound form for optimization, we propose a contrastive learning strategy regulated by progressively refined cross-modal similarity, to more accurately optimize MI between an image/text anchor and its negative texts/images instead of improperly minimizing it. Our method performs competitively on four downstream cross-modal tasks and systematically balances the beneficial and harmful effects of (partial) false negative samples under theoretical guidance.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### The problems the paper attempts to solve This paper attempts to solve the problem of partial false negatives in cross - modal contrastive learning in Vision Language Pretraining (VLP). Specifically, the paper re - examines this problem from the perspective of Mutual Information (MI) optimization and proposes a new contrastive learning strategy to more accurately optimize the mutual information between images and text anchors and their negative samples through gradually refined cross - modal similarity adjustment. ### Problem background In vision - language pre - training, Self - supervised Multi - modal Contrastive Learning (SMCL) has made significant progress through image - to - text and text - to - image contrastive learning. However, this contrastive strategy has a problem when dealing with one - to - many correspondence relationships between images and texts: some negative samples are actually semantically consistent or partially consistent, and these partial false negatives will impede the contrast effect, resulting in sub - optimal cross - modal representations. ### Solutions 1. **Theoretical analysis**: - The authors re - examine the InfoNCE loss function from the perspective of mutual information optimization. They prove that in the presence of non - negligible partial false negatives, optimizing InfoNCE is equivalent to maximizing the lower bound of the mutual information difference (MI - P - MI - N). This reveals that the traditional contrastive strategy may overly minimize the mutual information between partial false negatives and anchors, thus affecting the degree of structuring of the representation space. 2. **New contrastive learning strategy**: - Based on the above theoretical analysis, the authors propose a new contrastive learning strategy, namely Similarity - Regulated Contrastive Learning (SRCL). Specifically, they introduce a contrastive weight, which is based on cross - modal similarity and gradually refined with training, to adjust the contrast effect of each negative sample. This regulator can guide the model to appropriately optimize MI - N, avoid it being accidentally minimized, and thus generate a more semantically structured representation space. ### Experimental results 1. **Downstream task performance**: - The authors evaluate the effect of SRCL on multiple downstream tasks, including Visual Question Answering (VQA), Cross - modal Retrieval, Zero - shot Cross - modal Retrieval, and Natural Language for Visual Reasoning (NLVR). The experimental results show that SRCL significantly improves performance on these tasks. 2. **Quantitative analysis**: - The authors verify the effectiveness of the method by removing different proportions of false negative samples (or hard negative samples). The experimental results show that moderately removing false negative samples can improve the performance of downstream tasks, but excessive removal will lead to performance degradation. This verifies that SRCL systematically balances the beneficial and harmful effects of false negative samples through cross - modal similarity regulation. 3. **Qualitative analysis**: - The authors also conduct a qualitative analysis by visualizing the zero - shot text - image retrieval results. The results show that SRCL can more comprehensively capture the potential similarities between text descriptions and images, and the ranking of retrieval results reflects the trend from full match to partial match, further verifying the effectiveness of the method. ### Summary This paper proves the adverse effects of partial false negatives in cross - modal contrastive learning through theoretical analysis and experiments, and proposes a new contrastive learning strategy SRCL. By optimizing mutual information through cross - modal similarity regulation, it significantly improves the performance of vision - language pre - training models on multiple downstream tasks.