Abstract:This work addresses composed image retrieval in the context of domain conversion, where the content of a query image is retrieved in the domain specified by the query text. We show that a strong vision-language model provides sufficient descriptive power without additional training. The query image is mapped to the text input space using textual inversion. Unlike common practice that invert in the continuous space of text tokens, we use the discrete word space via a nearest-neighbor search in a text vocabulary. With this inversion, the image is softly mapped across the vocabulary and is made more robust using retrieval-based augmentation. Database images are retrieved by a weighted ensemble of text queries combining mapped words with the domain text. Our method outperforms prior art by a large margin on standard and newly introduced benchmarks. Code: <a class="link-external link-https" href="https://github.com/NikosEfth/freedom" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the Composed Image Retrieval (CIR) problem in **Domain Conversion (cross - domain image retrieval)**. Specifically, the author proposes a training - free composed image retrieval method named F REEDOM, which is used to retrieve images with the category of the query image and conforming to the style or context of the target domain given a query image and a target - domain text. ### Problem Background Traditional composed image retrieval methods usually rely on supervised learning and require a large amount of labeled data to train the model. However, the acquisition cost of labeled data is high, which limits the application scope of these methods. In addition, most of the existing methods are limited to specific application scenarios, such as fashion, physical state, etc. In recent years, the emergence of vision - language models (VLMs) has provided new ideas for composed image retrieval, but most methods still require additional training or the use of large - scale language models (LLMs). ### Main Contributions of the Paper 1. **Focus on Domain Conversion Tasks**: This is the first composed image retrieval research specifically for domain conversion tasks. It also introduces three new benchmark datasets and extends an existing dataset to cover more source domains. 2. **Propose the F REEDOM Method**: This is a training - free composed image retrieval method that achieves open - world domain conversion through the frozen CLIP model. 3. **Improve Text Inversion Technology**: It is proved that text inversion in the discrete vocabulary space is more effective than in the continuous pseudo - word latent space. 4. **Significantly Outperform Existing Methods**: F REEDOM far surpasses all existing methods on four benchmark datasets. 5. **Provide a Basis for Future Research**: The experimental results provide a test platform for future composed image retrieval research. ### Method Overview The core idea of F REEDOM is to map the query image to the text space through text inversion and then combine it with the target - domain text for retrieval. The specific steps are as follows: 1. **Query Image Embedding**: Embed the query image into the embedding space of a pre - trained vision - language model. 2. **Text Inversion**: Map the query image embedding back to the discrete text space through nearest - neighbor search to obtain a series of text labels most similar to the query image. 3. **Composed Query**: Combine these text labels with the target - domain text into a new query text, and generate an embedding representation of the composed query through a text encoder. 4. **Retrieval**: Retrieve the most similar images from the database according to the embedding representation of the composed query. In this way, F REEDOM can efficiently complete cross - domain image retrieval tasks without additional training. ### Experimental Results The experimental results show that F REEDOM significantly outperforms existing composed image retrieval methods on multiple benchmark datasets. In particular, on the ImageNet - R, MiniDomainNet, NICO++ and LTLL datasets, F REEDOM achieves mAP improvements of 15.87%, 14.32%, 6.44% and 6.64% respectively. ### Summary This paper solves the cross - domain conversion task in composed image retrieval by proposing the F REEDOM method, demonstrating the great potential of training - free composed image retrieval methods. This research result provides a new direction and basis for future composed image retrieval research.

Composed Image Retrieval for Training-Free Domain Conversion

Domain-Specific Modeling and Semantic Alignment for Image-Based 3d Model Retrieval

Composed Image Retrieval for Remote Sensing

Bi-directional Training for Composed Image Retrieval via Text Prompt Learning

Composed Query Image Retrieval Using Locally Bounded Features

Vision-by-Language for Training-Free Compositional Image Retrieval

A Multimodal Approach for Cross-Domain Image Retrieval

Training-free Zero-shot Composed Image Retrieval with Local Concept Reranking

Semantics-Aware Image to Image Translation and Domain Transfer

Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features

Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval

Generalized Image Embedding for Multi-Domain Image Retrieval.

Unsupervised Multi-Domain Image Translation with Domain-Specific Encoders/Decoders

Fully Unsupervised Domain-Agnostic Image Retrieval

From Bits to Images: Inversion of Local Binary Descriptors

Harnessing the Conditioning Sensorium for Improved Image Translation

COLA: A Benchmark for Compositional Text-to-image Retrieval

Retrieval Guided Unsupervised Multi-domain Image-to-Image Translation

Compositional Dictionaries for Domain Adaptive Face Recognition

Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image Retrieval

TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition