Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features

Alberto Baldrati,Marco Bertini,Tiberio Uricchio,Alberto del Bimbo
2023-08-22
Abstract:Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one that integrates the modifications expressed by the caption. Given that recent research has demonstrated the efficacy of large-scale vision and language pre-trained (VLP) models in various tasks, we rely on features from the OpenAI CLIP model to tackle the considered task. We initially perform a task-oriented fine-tuning of both CLIP encoders using the element-wise sum of visual and textual features. Then, in the second stage, we train a Combiner network that learns to combine the image-text features integrating the bimodal information and providing combined features used to perform the retrieval. We use contrastive learning in both stages of training. Starting from the bare CLIP features as a baseline, experimental results show that the task-oriented fine-tuning and the carefully crafted Combiner network are highly effective and outperform more complex state-of-the-art approaches on FashionIQ and CIRR, two popular and challenging datasets for composed image retrieval. Code and pre-trained models are available at <a class="link-external link-https" href="https://github.com/ABaldrati/CLIP4Cir" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to retrieve images that are visually similar to the reference image and meet the modification requirements in the description when given a query containing a reference image and a relative description. Specifically, the authors propose a method that combines contrastive learning and task - oriented CLIP features to achieve this goal. The following are the main contributions of the paper: 1. **Propose a new task - oriented fine - tuning scheme**: Through this scheme, large - scale pre - trained vision - language models can be adapted to the combined image retrieval task, aiming to reduce the mismatch between large - scale pre - training and downstream tasks. 2. **Propose a two - stage method**: The first stage is to perform task - oriented fine - tuning on the CLIP encoder to improve the additive property of the embedding space; the second stage is to train a combiner network from scratch starting from the task - oriented features, which can perform fine - grained fusion of image - text features. 3. **Solve the problem of handling aspect - ratio images**: Since the CLIP visual encoder can only input square pictures, the authors propose a new pre - processing pipeline, which helps to reduce the loss of content information in the image retrieval task. 4. **Conduct multiple qualitative experiments**: These experiments aim to show how the proposed method affects the feature distribution in the embedding space and the influence of pairwise feature distances on retrieval performance, and use the gradCAM technique to visualize the most significant image parts during the retrieval process. The paper proves through experiments on two standard and challenging datasets, FashionIQ and CIRR, that the proposed two - stage method achieves state - of - the - art results. This shows that through task - oriented fine - tuning and a carefully designed combiner network, large - scale pre - trained vision - language models can be effectively utilized to solve the combined image retrieval problem.