Dr. CLIP: CLIP-Driven Universal Framework for Zero-Shot Sketch Image Retrieval

Xue Li,Jiong Yu,Ziyang Li,Hongchun Lu,Ruifeng Yuan
DOI: https://doi.org/10.1145/3664647.3680702
2024-01-01
Abstract:The field of Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) is undergoing a paradigm shift, transitioning from specialized models designed for individual tasks to more general retrieval models capable of managing various specialized scenarios. Inspired by the impressive generalization ability of the Contrastive Language-Image Pretraining (CLIP) model, we propose a CLIP-driven universal framework (Dr. CLIP), which leverages prompt learning to guide the synergy between CLIP and ZS-SBIR. Dr. CLIP can perfectly cover four variants of ZS-SBIR tasks (inter-category, intra-category, cross-datasets, and generalization). Moreover, we decompose the synergy into classification learning, metric learning, and ranking learning, as well as introduce three key components to enhance learning effectiveness. i) a forgetting suppression idea is applied to prevent catastrophic forgetting and constrains the feature distribution of the new categories in classification learning. ii) a domain balanced loss is proposed to address sample imbalance and establish effective cross-domain correlations in metric learning. iii) a pair-relation strategy is introduced to capture relevance and ranking relationships between instances in ranking learning. Eventually, we reorganize and redivide three coarse-grained datasets and two fine-grained datasets to accommodate the training settings for four ZS-SBIR tasks. The comparison experiments confirmed our method surpassed the state-of-the-art (SOTA) methods by a significant margin (1.95% ~ 19.14,% mAP), highlighting its generality and superiority. The code is available at https://github.com/x-28/CDUF.git.
What problem does this paper attempt to address?