Deep Supervision Network with Contrastive Learning for Zero-Shot Sketch-Based Image Retrieval

Zhenqiu Shu,Guangyao Zhuo,Jun Yu,Zhengtao Yu
DOI: https://doi.org/10.1016/j.asoc.2024.112474
IF: 8.7
2024-01-01
Applied Soft Computing
Abstract:Zero-shot sketch-based image retrieval (ZS-SBIR) is an extremely challenging cross-modal retrieval task. In ZS-SBIR, hand-drawn sketches are used as queries to retrieve corresponding natural images in zero-shot scenarios. Existing methods utilize diverse loss functions to guide deep neural networks (DNNs) to align feature representations of both sketches and images. In general, these methods supervise only the last layer of DNNs and then update each layer of DNNs using back-propagate technology. However, this strategy cannot effectively optimize the intermediate layers of DNNs, potentially hindering retrieval performance. To address this issue, we propose a deep supervision network with contrastive learning (DSNCL) approach for ZS-SBIR. Specifically, we employ a novel deep supervision network training method that attaches multiple projection heads to the intermediate layers of DNNs. These projection heads map multi-level features to a normalized embedding space and are trained by contrastive learning. The proposed method instructs the intermediate layers of DNNs to learn the invariance of various data augmentation, thereby aligning the feature representations of both sketches and images. This significantly narrows its domain gap and semantic gap. Besides, we use contrastive learning to directly optimize the intermediate layers of DNNs, which effectively reduces the optimization difficulty of their intermediate layers. Furthermore, we investigate the cross-batch metric (CBM) learning mechanism, which stores samples of different batches for metric learning by constructing a semantic queue, to further improve the performance in ZS-SBIR applications. Comprehensive experimental results on the Sketchy and TU-Berlin datasets validate the superiority of our DSNCL method over existing state-of-the-art methods.
What problem does this paper attempt to address?