Abstract:Most existing image-text matching methods adopt triplet loss as the optimization objective, and choosing a proper negative sample for the triplet of <anchor, positive, negative> is important for effectively training the model, e.g., hard negatives make the model learn efficiently and effectively. However, we observe that existing methods mainly employ the most similar samples as hard negatives, which may not be true negatives. In other words, the samples with high similarity but not paired with the anchor may reserve positive semantic associations, and we call them false negatives. Repelling these false negatives in triplet loss would mislead the semantic representation learning and result in inferior retrieval performance. In this paper, we propose a novel False Negative Elimination (FNE) strategy to select negatives via sampling, which could alleviate the problem introduced by false negatives. Specifically, we first construct the distributions of positive and negative samples separately via their similarities with the anchor, based on the features extracted from image and text encoders. Then we calculate the false negative probability of a given sample based on its similarity with the anchor and the above distributions via the Bayes' rule, which is employed as the sampling weight during negative sampling process. Since there may not exist any false negative in a small batch size, we design a memory module with momentum to retain a large negative buffer and implement our negative sampling strategy spanning over the buffer. In addition, to make the model focus on hard negatives, we reassign the sampling weights for the simple negatives with a cut-down strategy. The extensive experiments are conducted on Flickr30K and MS-COCO, and the results demonstrate the superiority of our proposed false negative elimination strategy. The code is available at <a class="link-external link-https" href="https://github.com/LuminosityX/FNE" rel="external noopener nofollow">this https URL</a>.

MAFA: Managing False Negatives for Vision-Language Pre-training

GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training

Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation

Accelerating Vision-Language Pretraining with Free Language Modeling

Language Model Pre-training on True Negatives

FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models

Your Negative May not Be True Negative: Boosting Image-Text Matching with False Negative Elimination

Noise-robust Vision-language Pre-training with Positive-negative Learning

MedFLIP: Medical Vision-and-Language Self-supervised Fast Pre-Training with Masked Autoencoder

VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment

Leveraging per Image-Token Consistency for Vision-Language Pre-training

Adapting Vision-Language Model with Fine-grained Semantics for Open-Vocabulary Segmentation

See, Say, and Segment: Teaching LMMs to Overcome False Premises

ViLReF: An Expert Knowledge Enabled Vision-Language Retinal Foundation Model

Learning from True-False Labels via Multi-modal Prompt Retrieving

CAVL: Learning Contrastive and Adaptive Representations of Vision and Language

Multi-task Paired Masking with Alignment Modeling for Medical Vision-Language Pre-training

M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry Optimization

IDEA: Increasing Text Diversity via Online Multi-Label Recognition for Vision-Language Pre-training

Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment Analysis

Enhancing Vision-Language Few-Shot Adaptation with Negative Learning