Abstract:We address the challenges inherent in sketch-based image retrieval (SBIR) across various settings, including zero-shot SBIR, generalized zero-shot SBIR, and fine-grained zero-shot SBIR, by leveraging the vision-language foundation model CLIP. While recent endeavors have employed CLIP to enhance SBIR, these approaches predominantly follow uni-modal prompt processing and overlook to exploit CLIP's integrated visual and textual capabilities fully. To bridge this gap, we introduce SpLIP, a novel multi-modal prompt learning scheme designed to operate effectively with frozen CLIP backbones. We diverge from existing multi-modal prompting methods that treat visual and textual prompts independently or integrate them in a limited fashion, leading to suboptimal generalization. SpLIP implements a bi-directional prompt-sharing strategy that enables mutual knowledge exchange between CLIP's visual and textual encoders, fostering a more cohesive and synergistic prompt processing mechanism that significantly reduces the semantic gap between the sketch and photo embeddings. In addition to pioneering multi-modal prompt learning, we propose two innovative strategies for further refining the embedding space. The first is an adaptive margin generation for the sketch-photo triplet loss, regulated by CLIP's class textual embeddings. The second introduces a novel task, termed conditional cross-modal jigsaw, aimed at enhancing fine-grained sketch-photo alignment by implicitly modeling sketches' viable patch arrangement using knowledge of unshuffled photos. Our comprehensive experimental evaluations across multiple benchmarks demonstrate the superior performance of SpLIP in all three SBIR scenarios. Project page: <a class="link-external link-https" href="https://mainaksingha01.github.io/SpLIP/" rel="external noopener nofollow">this https URL</a> .

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to improve the effectiveness of sketch - based image retrieval (SBIR) in different scenarios, especially zero - shot SBIR (ZS - SBIR), generalized zero - shot SBIR (GZS - SBIR) and fine - grained zero - shot SBIR (FG - ZS - SBIR). Specifically, the authors point out that existing methods mainly rely on unimodal prompt processing and fail to fully utilize the integration ability of visual and textual information in the CLIP model, resulting in limited performance. For this reason, they propose a new multimodal prompt learning scheme - SpLIP. ### Specific description of the problem 1. **Cross - domain differences**: Although sketches and photos belong to the same category, there are significant differences between them because they come from different domains. 2. **Zero - shot learning challenges**: The categories in the test set have not been seen in the training stage, which increases the requirements for the generalization ability of the model. 3. **Fine - grained matching**: In the FG - ZS - SBIR task, it is required to accurately match specific instances in sketches and photos, which poses higher requirements for the model. ### Shortcomings of existing methods - **Unimodal prompts**: Most existing methods only use unimodal prompts (such as only visual or only textual), ignoring the complementarity of visual and textual information in the CLIP model. - **Static text elements**: Some methods fail to fully adapt to the flexibility of the text path, making them insensitive to visual nuances. - **Local and global context**: The existing patch - shuffling strategies may lead to over - fitting and cannot effectively bridge the gap between local and global context. ### SpLIP's solutions To overcome the above problems, SpLIP introduces the following innovations: 1. **Bidirectional prompt sharing**: By realizing bidirectional information exchange between the text and visual encoders of CLIP, the synergy of prompts is enhanced, and the semantic gap between sketch and photo embeddings is reduced. 2. **Conditional cross - modal jigsaw task**: A new task, namely conditional cross - modal jigsaw, is introduced, aiming to enhance fine - grained sketch - photo alignment by implicitly modeling the knowledge of unshuffled photos. 3. **Adaptive margin generation**: An adaptive margin scheme is introduced for the triplet loss, regulated by the class - text embeddings of CLIP, to better adapt to the semantic distances of different categories. These improvements make SpLIP significantly outperform existing methods on multiple benchmark datasets, especially in zero - shot and fine - grained retrieval tasks.

Elevating All Zero-Shot Sketch-Based Image Retrieval Through Multimodal Prompt Learning

CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not

Dual-Modal Prompting for Sketch-Based Image Retrieval

Towards Alleviating Text-to-Image Retrieval Hallucination for CLIP in Zero-shot Learning

Exploiting Unlabelled Photos for Stronger Fine-Grained SBIR

Zero-Shot Everything Sketch-Based Image Retrieval, and in Explainable Style

GOPro: Generate and Optimize Prompts in CLIP using Self-Supervised Learning

Progressive Cross-Modal Semantic Network for Zero-Shot Sketch-Based Image Retrieval

In the Era of Prompt Learning with Vision-Language Models

Indicative Vision Transformer for end-to-end zero-shot sketch-based image retrieval

Stacked Semantic-Guided Network for Zero-Shot Sketch-Based Image Retrieval.

Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning

Modality-Aware Representation Learning for Zero-shot Sketch-based Image Retrieval

An Image is Worth Multiple Words: Discovering Object Level Concepts using Multi-Concept Prompt Learning

Sorting out glycosylation enzymes in the Golgi apparatus

Domain-Smoothing Network for Zero-Shot Sketch-Based Image Retrieval

Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval

Relation-Aware Meta-Learning for Zero-shot Sketch-Based Image Retrieval

APPLeNet: Visual Attention Parameterized Prompt Learning for Few-Shot Remote Sensing Image Generalization using CLIP

Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP

BDA-SketRet: Bi-Level Domain Adaptation for Zero-Shot SBIR