Abstract:Composed image retrieval (CIR) is the task of searching target images using an image-text pair as a query. Given the straightforward relation of query pair-target image, the dominant methods follow the learning paradigm of common image-text retrieval and simply model this problem as the query-target matching problem. Particularly, the common practice first encodes the multi-modal query into one feature and then aligns it with the target image. However, such a learning paradigm only explores the naive relation in the triplets. We argue that CIR triplets encompass additional associations besides the primary query-target relation, which is overlooked in existing works. In this paper, we disclose two new relations residing in the triplets by viewing the triplet as a graph node. In analogy with the graph node, we mine two associations of text-bridged image alignment and complementary text reasoning. The text-bridged image alignment considers composed image retrieval as a specialized form of image retrieval, where the query text acts as a bridge between the query image and the target one, and a hinge-based cross attention is proposed to incorporate this relation into the network learning. On the other hand, the association of complementary text reasoning regards composed image retrieval as a specific type of cross-modal retrieval, where the composite two images are used to reason the complementary text. To integrate these views effectively, a twin attention-based compositor is designed. By combining these two types of complementary associations with the explicit query pair-target image relation, we establish a comprehensive set of constraints for composed image retrieval. With the above designs, we finally developed our CaLa, a Complementary Association Learning framework for Augmenting Composed Image Retrieval. Experimental evaluations are conducted on the widely-used CIRR and FashIionIQ benchmarks with multiple backbones to validate the effectiveness of our CaLa. The results demonstrate the superiority of our method in the composed image retrieval task. Our code and models are available at https://github.com/Chiangsonw/CaLa

Composed Image Retrieval Via Cross Relation Network with Hierarchical Aggregation Transformer.

Comprehensive Linguistic-Visual Composition Network for Image Retrieval

Multi-Modal Transformer With Global-Local Alignment for Composed Query Image Retrieval

Target-Guided Composed Image Retrieval

Hierarchical Composition Learning for Composed Query Image Retrieval.

Exploring Uni-Modal Feature Learning on Entities and Relations for Remote Sensing Cross-Modal Text-Image Retrieval

Align and Retrieve: Composition and Decomposition Learning in Image Retrieval with Text Feedback

CLIP-Based Composed Image Retrieval with Comprehensive Fusion and Data Augmentation.

Dual Relation Alignment for Composed Image Retrieval

Geometry Sensitive Cross-Modal Reasoning for Composed Query Based Image Retrieval

Multi-Grained Attention Network with Mutual Exclusion for Composed Query-Based Image Retrieval

Bottom-Up Transformer Reasoning Network for Text-Image Retrieval.

Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval

Hierarchical Feature Aggregation based on Transformer for Image-text Matching

Context‐aware relation enhancement and similarity reasoning for image‐text retrieval

CaLa: Complementary Association Learning for Augmenting Comoposed Image Retrieval

Image Retrieval with Composed Query by Multi-Scale Multi-Modal Fusion.

CaLa: Complementary Association Learning for Augmenting Composed Image Retrieval

Reservoir Computing Transformer for Image-Text Retrieval

Multi-task hierarchical convolutional network for visual-semantic cross-modal retrieval

Visual Relations Augmented Cross-modal Retrieval