MoTaDual: Modality-Task Dual Alignment for Enhanced Zero-shot Composed Image Retrieval

Haiwen Li,Fei Su,Zhicheng Zhao
2024-10-31
Abstract:Composed Image Retrieval (CIR) is a challenging vision-language task, utilizing bi-modal (image+text) queries to retrieve target images. Despite the impressive performance of supervised CIR, the dependence on costly, manually-labeled triplets limits its scalability and zero-shot capability. To address this issue, zero-shot composed image retrieval (ZS-CIR) is presented along with projection-based approaches. However, such methods face two major problems, i.e., task discrepancy between pre-training (image $\leftrightarrow$ text) and inference (image+text $\rightarrow$ image), and modality discrepancy. The latter pertains to approaches based on text-only projection training due to the necessity of feature extraction from the reference image during inference. In this paper, we propose a two-stage framework to tackle both discrepancies. First, to ensure efficiency and scalability, a textual inversion network is pre-trained on large-scale caption datasets. Subsequently, we put forward Modality-Task Dual Alignment (MoTaDual) as the second stage, where large-language models (LLMs) generate triplet data for fine-tuning, and additionally, prompt learning is introduced in a multi-modal context to effectively alleviate both modality and task discrepancies. The experimental results show that our MoTaDual achieves the state-of-the-art performance across four widely used ZS-CIR benchmarks, while maintaining low training time and computational cost. The code will be released soon.
Computer Vision and Pattern Recognition,Information Retrieval
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve two main problems in zero - shot composed image retrieval (ZS - CIR): **task discrepancy** and **modality discrepancy**. 1. **Task Discrepancy**: - During the pre - training stage, the model usually only involves mapping image or text features to specific tokens, without performing feature combination and target retrieval. This leads to poor performance when directly using template prompts (such as "a photo of <S*> that Tc") to retrieve target images during the inference stage. - Vision - Language Pretraining (VLP) models (such as CLIP) are pre - trained through image - text matching, so during the inference stage, these models have difficulty adapting to the potential captions of target images. 2. **Modality Discrepancy**: - Existing projection - based methods (such as Lincir) rely on training only with text for efficiency and scalability. However, during the inference stage, they must extract features from reference images, resulting in modality discrepancy. - Specifically, these methods do not fully utilize image information during training but need to rely on image features during inference, which makes the model perform poorly when handling the conversion between images and texts. To solve these problems, the authors propose a two - stage framework: 1. **First stage**: Pre - train a textual inversion network, which is trained only on a large - scale caption dataset to ensure efficiency and scalability. 2. **Second stage**: Introduce **Modality - Task Dual Alignment (MoTaDual)**, which alleviates task and modality discrepancies through multi - modal prompt learning. Specifically, use large - language models (LLMs) to generate new triple - data for fine - tuning and adopt the prompt - tuning method to reduce computational complexity. Through this method, MoTaDual achieves state - of - the - art performance on four widely - used ZS - CIR benchmark datasets while maintaining low training time and computational cost. ### Formula Summary - **Text Encoder Output**: \[ [W_{i + 1}] = T_{i+1}(W_i)\quad \text{for}\quad i = 1,2,\dots,L \] - **Visual Encoder Output**: \[ [x_{i + 1}, E_{i+1}] = I_{i+1}([x_i, E_i])\quad \text{for}\quad i = 1,2,\dots,K \] - **Cross - Modal Projection Matching Loss (CMPM Loss)**: \[ L_{c2t}=KL(p\|q)=\frac{1}{N}\sum_{i = 1}^N\sum_{j = 1}^Np_{i,j}\log\left(\frac{p_{i,j}}{q_{i,j}+\epsilon}\right) \] where \[ q_{i,j}=\frac{y_{i,j}}{\sum_{k = 1}^N y_{i,k}} \] - **Overall Optimization Objective**: \[ L_{mtda}=L_{c2t}+L_{t2c} \] These formulas show the encoding processes of text and image features and the loss functions used to optimize the model.