Abstract:Multimodal models leverage large-scale pre-training to achieve strong but still imperfect performance on tasks such as image captioning, visual question answering, and cross-modal retrieval. In this paper, we present a simple and efficient method for correcting errors in trained contrastive image-text retrieval models with no additional training, called Nearest Neighbor Normalization (NNN). We show an improvement on retrieval metrics in both text retrieval and image retrieval for all of the contrastive models that we tested (CLIP, BLIP, ALBEF, SigLIP, BEiT) and for both of the datasets that we used (MS-COCO and Flickr30k). NNN requires a reference database, but does not require any training on this database, and can even increase the retrieval accuracy of a model after finetuning.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: How to improve the performance of contrastive learning image - text retrieval models without the need for additional training. Specifically, the authors propose a method called "Nearest Neighbor Normalization (NNN)" to correct the errors in the trained contrastive learning models and enhance their performance in text retrieval and image retrieval tasks. ### Problem Background Contrastive learning models (such as CLIP, BLIP, etc.) have achieved significant but imperfect performance in tasks such as image captioning, visual question answering, and cross - modal retrieval through large - scale pre - training. These models use contrastive loss functions (such as InfoNCE) to learn joint text and image embeddings, making the embeddings between matching text and image pairs closer and those between non - matching pairs farther. However, contrastive embedding optimizes the pre - training objective (such as InfoNCE), not the accuracy of downstream retrieval tasks, so the learned embeddings may not be optimized enough for retrieval tasks. ### Proposed Solution To solve this problem, the authors propose the NNN method, which has the following main features: 1. **No Additional Training Required**: NNN does not require any training on the reference database and only needs to be applied during the inference stage. 2. **Efficiency**: The time complexity of NNN is sub - linear with respect to the size of the reference database and only requires a small computational overhead. 3. **Reducing the Hubness Problem**: NNN reduces the hubness problem (i.e., some images or texts frequently appear in the results of multiple queries, leading to incorrect matches) by using the nearest - neighbor query embeddings in the reference query database to correct the bias of each retrieval candidate. ### Method Principle The core idea of NNN is to correct the embeddings that are assigned overly high or low retrieval scores by normalizing the score of each retrieval candidate. Specifically, for each retrieval candidate \( r \), NNN calculates a bias \( b(r) \), which is based on the average score of the \( k \) queries in the reference query database \( D \) that are most similar to \( r \). The formula for calculating the bias \( b(r) \) is: \[ b(r)=\alpha\cdot\frac{1}{k}\sum_{q_j\in D_{topk}(r)}s(q_j,r) \] where \( D_{topk}(r)=\arg\max_k s(q,r) \) represents the \( k \) queries that are most similar to \( r \), \( s(q,r) \) is the matching score between the query \( q \) and the retrieval candidate \( r \) (usually cosine similarity), and \( \alpha \) is a constant coefficient. The final de - biased retrieval score \( s_D(q,r) \) is obtained by subtracting the estimated bias from the original score: \[ s_D(q,r)=s(q,r)-b(r) \] ### Experimental Results The experimental results show that NNN can significantly improve retrieval performance on multiple contrastive learning models (such as CLIP, BLIP, etc.) and datasets (such as MS - COCO, Flickr30k), especially in reducing the hubness problem and gender bias. ### Summary By introducing the NNN method, the authors provide a simple and efficient solution that can significantly improve the performance of contrastive learning models in multi - modal retrieval tasks without additional training.

Nearest Neighbor Normalization Improves Multimodal Retrieval

Normalized Contrastive Learning for Text-Video Retrieval

ACMNet

With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning of Visual Representations

Cross Modal Retrieval with Querybank Normalisation

Multimodal Pretraining from Monolingual to Multilingual

Adaptive CLIP for open-domain 3D model retrieval

Iterative Uni-modal and Cross-modal Clustered Contrastive Learning for Image-text Retrieval

Improving the Consistency in Cross-Lingual Cross-Modal Retrieval with 1-to-K Contrastive Learning

Improving Cross-Modal Image-Text Retrieval With Teacher-Student Learning

NAC: Mitigating Noisy Correspondence in Cross-Modal Matching Via Neighbor Auxiliary Corrector.

Cross-Modal Retrieval With Noisy Correspondence via Consistency Refining and Mining

Integrating Multi-Label Contrastive Learning With Dual Adversarial Graph Neural Networks for Cross-Modal Retrieval

Linking Representations with Multimodal Contrastive Learning

Is Cross-modal Information Retrieval Possible without Training?

Multimodal Neural Machine Translation with Search Engine Based Image Retrieval

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

Preserving Semantic Neighborhoods for Robust Cross-modal Retrieval

Cross-Modal Denoising: A Novel Training Paradigm for Enhancing Speech-Image Retrieval

Exploring Nearest Neighbor Approaches for Image Captioning

CMPD: Using Cross Memory Network With Pair Discrimination for Image-Text Retrieval