Indicative Vision Transformer for end-to-end zero-shot sketch-based image retrieval
Haoxiang Zhang,Deqiang Cheng,Qiqi Kou,Mujtaba Asad,He Jiang
DOI: https://doi.org/10.1016/j.aei.2024.102398
IF: 8.8
2024-02-15
Advanced Engineering Informatics
Abstract:Zero-shot sketch-based image retrieval (ZS-SBIR) has garnered attention for overcoming inconvenience and impracticality of Traditional Image Retrieval (TIR) in the engineering domain. ZS-SBIR can retrieve never-before-seen images with sketches, solving the dilemmas of insufficient samples and model retraining. However, existing ZS-SBIR approaches have the following remaining limitations: Firstly, CNN-based methods struggle to capture global features effectively. Secondly, hybrid networks treat sketch and image modalities separately, ignoring the implied feature consistency. Thirdly, non-end-to-end Vision Transformer (ViT) models incur expensive training costs. To solve the above problem, we present an end-to-end retrieval approach, which first extends the ViT through indicative information. The key core of the algorithm is that we propose a feature picker with indicative multi-layer perception. It collectively processes images and sketches with relatively economical consumption, while yielding surprising benefits. To tackle the inherent modal and semantic gaps in ZS-SBIR, we propose a parallel feature adapter. In this adapter, the features are modulated by an identification learning module to generate discriminative information. Next the feature-level smooth alignment is utilized to focus on enhancing the learning of inter-class relationships. In addition, we employ logit-level auxiliary signal to direct the model to capture additional semantic knowledge. Extensive experiments show that the proposed approach significantly outperforms state-of-the-art retrieval methods on Sketchy, Sketchy-No, QuickDraw, and the Tuberlin datasets.
engineering, multidisciplinary,computer science, artificial intelligence