Nearest Neighbor Normalization Improves Multimodal Retrieval

Neil Chowdhury,Franklin Wang,Sumedh Shenoy,Douwe Kiela,Sarah Schwettmann,Tristan Thrush
2024-11-01
Abstract:Multimodal models leverage large-scale pre-training to achieve strong but still imperfect performance on tasks such as image captioning, visual question answering, and cross-modal retrieval. In this paper, we present a simple and efficient method for correcting errors in trained contrastive image-text retrieval models with no additional training, called Nearest Neighbor Normalization (NNN). We show an improvement on retrieval metrics in both text retrieval and image retrieval for all of the contrastive models that we tested (CLIP, BLIP, ALBEF, SigLIP, BEiT) and for both of the datasets that we used (MS-COCO and Flickr30k). NNN requires a reference database, but does not require any training on this database, and can even increase the retrieval accuracy of a model after finetuning.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to improve the performance of contrastive learning image - text retrieval models without the need for additional training. Specifically, the authors propose a method called "Nearest Neighbor Normalization (NNN)" to correct the errors in the trained contrastive learning models and enhance their performance in text retrieval and image retrieval tasks. ### Problem Background Contrastive learning models (such as CLIP, BLIP, etc.) have achieved significant but imperfect performance in tasks such as image captioning, visual question answering, and cross - modal retrieval through large - scale pre - training. These models use contrastive loss functions (such as InfoNCE) to learn joint text and image embeddings, making the embeddings between matching text and image pairs closer and those between non - matching pairs farther. However, contrastive embedding optimizes the pre - training objective (such as InfoNCE), not the accuracy of downstream retrieval tasks, so the learned embeddings may not be optimized enough for retrieval tasks. ### Proposed Solution To solve this problem, the authors propose the NNN method, which has the following main features: 1. **No Additional Training Required**: NNN does not require any training on the reference database and only needs to be applied during the inference stage. 2. **Efficiency**: The time complexity of NNN is sub - linear with respect to the size of the reference database and only requires a small computational overhead. 3. **Reducing the Hubness Problem**: NNN reduces the hubness problem (i.e., some images or texts frequently appear in the results of multiple queries, leading to incorrect matches) by using the nearest - neighbor query embeddings in the reference query database to correct the bias of each retrieval candidate. ### Method Principle The core idea of NNN is to correct the embeddings that are assigned overly high or low retrieval scores by normalizing the score of each retrieval candidate. Specifically, for each retrieval candidate \( r \), NNN calculates a bias \( b(r) \), which is based on the average score of the \( k \) queries in the reference query database \( D \) that are most similar to \( r \). The formula for calculating the bias \( b(r) \) is: \[ b(r)=\alpha\cdot\frac{1}{k}\sum_{q_j\in D_{topk}(r)}s(q_j,r) \] where \( D_{topk}(r)=\arg\max_k s(q,r) \) represents the \( k \) queries that are most similar to \( r \), \( s(q,r) \) is the matching score between the query \( q \) and the retrieval candidate \( r \) (usually cosine similarity), and \( \alpha \) is a constant coefficient. The final de - biased retrieval score \( s_D(q,r) \) is obtained by subtracting the estimated bias from the original score: \[ s_D(q,r)=s(q,r)-b(r) \] ### Experimental Results The experimental results show that NNN can significantly improve retrieval performance on multiple contrastive learning models (such as CLIP, BLIP, etc.) and datasets (such as MS - COCO, Flickr30k), especially in reducing the hubness problem and gender bias. ### Summary By introducing the NNN method, the authors provide a simple and efficient solution that can significantly improve the performance of contrastive learning models in multi - modal retrieval tasks without additional training.