Abstract:Composed Image Retrieval (CIR) is a challenging vision-language task, utilizing bi-modal (image+text) queries to retrieve target images. Despite the impressive performance of supervised CIR, the dependence on costly, manually-labeled triplets limits its scalability and zero-shot capability. To address this issue, zero-shot composed image retrieval (ZS-CIR) is presented along with projection-based approaches. However, such methods face two major problems, i.e., task discrepancy between pre-training (image $\leftrightarrow$ text) and inference (image+text $\rightarrow$ image), and modality discrepancy. The latter pertains to approaches based on text-only projection training due to the necessity of feature extraction from the reference image during inference. In this paper, we propose a two-stage framework to tackle both discrepancies. First, to ensure efficiency and scalability, a textual inversion network is pre-trained on large-scale caption datasets. Subsequently, we put forward Modality-Task Dual Alignment (MoTaDual) as the second stage, where large-language models (LLMs) generate triplet data for fine-tuning, and additionally, prompt learning is introduced in a multi-modal context to effectively alleviate both modality and task discrepancies. The experimental results show that our MoTaDual achieves the state-of-the-art performance across four widely used ZS-CIR benchmarks, while maintaining low training time and computational cost. The code will be released soon.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve two main problems in zero - shot composed image retrieval (ZS - CIR): **task discrepancy** and **modality discrepancy**. 1. **Task Discrepancy**: - During the pre - training stage, the model usually only involves mapping image or text features to specific tokens, without performing feature combination and target retrieval. This leads to poor performance when directly using template prompts (such as "a photo of <S*> that Tc") to retrieve target images during the inference stage. - Vision - Language Pretraining (VLP) models (such as CLIP) are pre - trained through image - text matching, so during the inference stage, these models have difficulty adapting to the potential captions of target images. 2. **Modality Discrepancy**: - Existing projection - based methods (such as Lincir) rely on training only with text for efficiency and scalability. However, during the inference stage, they must extract features from reference images, resulting in modality discrepancy. - Specifically, these methods do not fully utilize image information during training but need to rely on image features during inference, which makes the model perform poorly when handling the conversion between images and texts. To solve these problems, the authors propose a two - stage framework: 1. **First stage**: Pre - train a textual inversion network, which is trained only on a large - scale caption dataset to ensure efficiency and scalability. 2. **Second stage**: Introduce **Modality - Task Dual Alignment (MoTaDual)**, which alleviates task and modality discrepancies through multi - modal prompt learning. Specifically, use large - language models (LLMs) to generate new triple - data for fine - tuning and adopt the prompt - tuning method to reduce computational complexity. Through this method, MoTaDual achieves state - of - the - art performance on four widely - used ZS - CIR benchmark datasets while maintaining low training time and computational cost. ### Formula Summary - **Text Encoder Output**: \[ [W_{i + 1}] = T_{i+1}(W_i)\quad \text{for}\quad i = 1,2,\dots,L \] - **Visual Encoder Output**: \[ [x_{i + 1}, E_{i+1}] = I_{i+1}([x_i, E_i])\quad \text{for}\quad i = 1,2,\dots,K \] - **Cross - Modal Projection Matching Loss (CMPM Loss)**: \[ L_{c2t}=KL(p\|q)=\frac{1}{N}\sum_{i = 1}^N\sum_{j = 1}^Np_{i,j}\log\left(\frac{p_{i,j}}{q_{i,j}+\epsilon}\right) \] where \[ q_{i,j}=\frac{y_{i,j}}{\sum_{k = 1}^N y_{i,k}} \] - **Overall Optimization Objective**: \[ L_{mtda}=L_{c2t}+L_{t2c} \] These formulas show the encoding processes of text and image features and the loss functions used to optimize the model.

MoTaDual: Modality-Task Dual Alignment for Enhanced Zero-shot Composed Image Retrieval

Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training

Training-free Zero-shot Composed Image Retrieval via Weighted Modality Fusion and Similarity

Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval

Compositional Image Retrieval via Instruction-Aware Contrastive Learning

Target-Guided Composed Image Retrieval

Bi-directional Training for Composed Image Retrieval via Text Prompt Learning

Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image Retrieval

Masking-Based Cross-Modal Remote Sensing Image–Text Retrieval via Dynamic Contrastive Learning

Pseudo-triplet Guided Few-shot Composed Image Retrieval

Zero-shot Composed Image Retrieval Considering Query-target Relationship Leveraging Masked Image-text Pairs

A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing

Vision-by-Language for Training-Free Compositional Image Retrieval

Self-Training Boosted Multi-Factor Matching Network for Composed Image Retrieval

MMoT: Mixture-of-Modality-Tokens Transformer for Composed Multimodal Conditional Image Synthesis

Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning

COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives

Language-only Efficient Training of Zero-shot Composed Image Retrieval

Enhanced Modality Transition for Image Captioning

Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval