Abstract:Composed image retrieval attempts to retrieve an image of interest from gallery images through a composed query of a reference image and its corresponding modified text. It has recently attracted attention due to the collaboration of information-rich images and concise language to precisely express the requirements of target images. Most current composed image retrieval methods follow a supervised learning approach to training on a costly triplet dataset composed of a reference image, modified text, and a corresponding target image. To avoid difficult to-obtain labeled triplet training data, zero-shot composed image retrieval (ZS-CIR) has been introduced, which aims to retrieve the target image by learning from image-text pairs (self-supervised triplets), without the need for human-labeled triplets. However, this self-supervised triplet learning approach is computationally less effective and less understandable as it assumes the interaction between image and text is conducted with implicit query embedding without explicit semantical interpretation. In this work, we present a new training-free zero-shot composed image retrieval method which translates the query into explicit human-understandable text. This helps improve model learning efficiency to enhance the generalization capacity of foundation models. Further, we introduce a Local Concept Re-ranking (LCR) mechanism to focus on discriminative local information extracted from the modified instructions. Extensive experiments on four ZS-CIR benchmarks show that our method achieves comparable performances to that of the state of-the-art triplet training based methods, but significantly outperforms other training-free methods on the open domain datasets (CIRR, CIRCO and COCO), as well as the fashion domain dataset (FashionIQ).

Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval

Training-free Zero-shot Composed Image Retrieval with Local Concept Reranking

Vision-by-Language for Training-Free Compositional Image Retrieval

Imagine and Seek: Improving Composed Image Retrieval with an Imagined Proxy

Training-free Zero-shot Composed Image Retrieval via Weighted Modality Fusion and Similarity

Language-only Efficient Training of Zero-shot Composed Image Retrieval

Dr. CLIP: CLIP-Driven Universal Framework for Zero-Shot Sketch Image Retrieval

Zero-shot Composed Text-Image Retrieval

Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval

Zero-Shot Everything Sketch-Based Image Retrieval, and in Explainable Style

Context-I2W: Mapping Images to Context-dependent Words for Accurate Zero-Shot Composed Image Retrieval

Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval

Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval

iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval

Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval

Zero-shot Composed Image Retrieval Considering Query-target Relationship Leveraging Masked Image-text Pairs

CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not

Zero-Shot Sketch-Based Image Retrieval via Graph Convolution Network

Pretrain like Your Inference: Masked Tuning Improves Zero-Shot Composed Image Retrieval.

HyCIR: Boosting Zero-Shot Composed Image Retrieval with Synthetic Labels

Three-Stream Joint Network for Zero-Shot Sketch-Based Image Retrieval