Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval

Jiacheng Cheng,Hijung Valentina Shin,Nuno Vasconcelos,Bryan Russell,Fabian Caba Heilbron
2024-05-06
Abstract:In the recent years, the dual-encoder vision-language models (\eg CLIP) have achieved remarkable text-to-image retrieval performance. However, we discover that these models usually results in very different retrievals for a pair of paraphrased queries. Such behavior might render the retrieval system less predictable and lead to user frustration. In this work, we consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries. To start with, we collect a dataset of paraphrased image descriptions to facilitate quantitative evaluation for this task. We then hypothesize that the undesired behavior of existing dual-encoder model is due to their text towers which are trained on image-sentence pairs and lack the ability to capture the semantic similarity between paraphrased queries. To improve on this, we investigate multiple strategies for training a dual-encoder model starting from a language model pretrained on a large text corpus. Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries while maintaining similar zero-shot classification and retrieval accuracy.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper primarily addresses the issue of inconsistent retrieval results in dual-encoder vision-language models (such as CLIP) when handling synonym substitution queries. Specifically: 1. **Problem Description**: - Current dual-encoder vision-language models (e.g., CLIP) return significantly different retrieval results for queries that are semantically similar but slightly different in wording. For example, "a child holding a box of pizza" and "a kid holding a box of pizza" have the same meaning, but CLIP returns drastically different results. 2. **Objective**: - Propose a method that enables the model to return similar retrieval results for synonym substitution queries with the same semantics, thereby improving user experience and system predictability. 3. **Solution**: - Collected a dataset of synonym substitution image descriptions to facilitate quantitative evaluation. - Explored various training strategies, starting from language models pre-trained on large-scale text corpora, to improve the dual-encoder model. This aims to significantly enhance the ranking similarity for synonym substitution queries while maintaining zero-shot classification and retrieval accuracy. Through the above methods, the paper aims to improve the consistency and robustness of vision-language models when handling synonym substitution queries.