Abstract:Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one that integrates the modifications expressed by the caption. Given that recent research has demonstrated the efficacy of large-scale vision and language pre-trained (VLP) models in various tasks, we rely on features from the OpenAI CLIP model to tackle the considered task. We initially perform a task-oriented fine-tuning of both CLIP encoders using the element-wise sum of visual and textual features. Then, in the second stage, we train a Combiner network that learns to combine the image-text features integrating the bimodal information and providing combined features used to perform the retrieval. We use contrastive learning in both stages of training. Starting from the bare CLIP features as a baseline, experimental results show that the task-oriented fine-tuning and the carefully crafted Combiner network are highly effective and outperform more complex state-of-the-art approaches on FashionIQ and CIRR, two popular and challenging datasets for composed image retrieval. Code and pre-trained models are available at <a class="link-external link-https" href="https://github.com/ABaldrati/CLIP4Cir" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to retrieve images that are visually similar to the reference image and meet the modification requirements in the description when given a query containing a reference image and a relative description. Specifically, the authors propose a method that combines contrastive learning and task - oriented CLIP features to achieve this goal. The following are the main contributions of the paper: 1. **Propose a new task - oriented fine - tuning scheme**: Through this scheme, large - scale pre - trained vision - language models can be adapted to the combined image retrieval task, aiming to reduce the mismatch between large - scale pre - training and downstream tasks. 2. **Propose a two - stage method**: The first stage is to perform task - oriented fine - tuning on the CLIP encoder to improve the additive property of the embedding space; the second stage is to train a combiner network from scratch starting from the task - oriented features, which can perform fine - grained fusion of image - text features. 3. **Solve the problem of handling aspect - ratio images**: Since the CLIP visual encoder can only input square pictures, the authors propose a new pre - processing pipeline, which helps to reduce the loss of content information in the image retrieval task. 4. **Conduct multiple qualitative experiments**: These experiments aim to show how the proposed method affects the feature distribution in the embedding space and the influence of pairwise feature distances on retrieval performance, and use the gradCAM technique to visualize the most significant image parts during the retrieval process. The paper proves through experiments on two standard and challenging datasets, FashionIQ and CIRR, that the proposed two - stage method achieves state - of - the - art results. This shows that through task - oriented fine - tuning and a carefully designed combiner network, large - scale pre - trained vision - language models can be effectively utilized to solve the combined image retrieval problem.

Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features

Semantic Compositions Enhance Vision-Language Contrastive Learning

Compositional Image Retrieval via Instruction-Aware Contrastive Learning

ComCLIP: Training-Free Compositional Image and Text Matching

Target-Guided Composed Image Retrieval

Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives

TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives

Bi-directional Training for Composed Image Retrieval via Text Prompt Learning

Vision-by-Language for Training-Free Compositional Image Retrieval

Exploiting CLIP-based Multi-modal Approach for Artwork Classification and Retrieval

COLA: A Benchmark for Compositional Text-to-image Retrieval

CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning

Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image Retrieval

Enhancing Image Retrieval : A Comprehensive Study on Photo Search using the CLIP Mode

CaLa: Complementary Association Learning for Augmenting Composed Image Retrieval

Pseudo-triplet Guided Few-shot Composed Image Retrieval

Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment

CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

Self-Training Boosted Multi-Factor Matching Network for Composed Image Retrieval

Iclip: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition