Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval

Jiacheng Cheng,Hijung Valentina Shin,Nuno Vasconcelos,Bryan Russell,Fabian Caba Heilbron

2024-05-06

Abstract:In the recent years, the dual-encoder vision-language models (\eg CLIP) have achieved remarkable text-to-image retrieval performance. However, we discover that these models usually results in very different retrievals for a pair of paraphrased queries. Such behavior might render the retrieval system less predictable and lead to user frustration. In this work, we consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries. To start with, we collect a dataset of paraphrased image descriptions to facilitate quantitative evaluation for this task. We then hypothesize that the undesired behavior of existing dual-encoder model is due to their text towers which are trained on image-sentence pairs and lack the ability to capture the semantic similarity between paraphrased queries. To improve on this, we investigate multiple strategies for training a dual-encoder model starting from a language model pretrained on a large text corpus. Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries while maintaining similar zero-shot classification and retrieval accuracy.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper primarily addresses the issue of inconsistent retrieval results in dual-encoder vision-language models (such as CLIP) when handling synonym substitution queries. Specifically: 1. **Problem Description**: - Current dual-encoder vision-language models (e.g., CLIP) return significantly different retrieval results for queries that are semantically similar but slightly different in wording. For example, "a child holding a box of pizza" and "a kid holding a box of pizza" have the same meaning, but CLIP returns drastically different results. 2. **Objective**: - Propose a method that enables the model to return similar retrieval results for synonym substitution queries with the same semantics, thereby improving user experience and system predictability. 3. **Solution**: - Collected a dataset of synonym substitution image descriptions to facilitate quantitative evaluation. - Explored various training strategies, starting from language models pre-trained on large-scale text corpora, to improve the dual-encoder model. This aims to significantly enhance the ranking similarity for synonym substitution queries while maintaining zero-shot classification and retrieval accuracy. Through the above methods, the paper aims to improve the consistency and robustness of vision-language models when handling synonym substitution queries.

Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval

Fine-tuning CLIP Text Encoders with Two-step Paraphrasing

Explaining Vision-Language Similarities in Dual Encoders with Feature-Pair Attributions

How Much Can CLIP Benefit Vision-and-Language Tasks?

Finetuning CLIP to Reason about Pairwise Differences

Understanding Retrieval-Augmented Task Adaptation for Vision-Language Models

Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Alignment

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Distilled Dual-Encoder Model for Vision-Language Understanding

Imperfect Vision Encoders: Efficient and Robust Tuning for Vision-Language Models

Generating Diverse and Descriptive Image Captions Using Visual Paraphrases

Adapting CLIP for Action Recognition via Dual Semantic Supervision and Temporal Prompt Reparameterization

Multi-Modal Adapter for Vision-Language Models

Vision-by-Language for Training-Free Compositional Image Retrieval

Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval

SpaceCLIP: A Vision-Language Pretraining Framework With Spatial Reconstruction On Text

CLIP-ReID: Exploiting Vision-Language Model for Image Re-identification without Concrete Text Labels

Adaptive CLIP for open-domain 3D model retrieval

Estimated Audio-Caption Correspondences Improve Language-Based Audio Retrieval