Abstract:This paper introduces a novel approach to enhance content-based image retrieval, validated on two benchmark datasets: ISIC-2017 and ISIC-2018. These datasets comprise skin lesion images that are crucial for innovations in skin cancer diagnosis and treatment. We advocate the use of pre-trained Vision Transformer (ViT), a relatively uncharted concept in the realm of image retrieval, particularly in medical scenarios. In contrast to the traditionally employed Convolutional Neural Networks (CNNs), our findings suggest that ViT offers a more comprehensive understanding of the image context, essential in medical imaging. We further incorporate a weighted multi-loss function, delving into various losses such as triplet loss, distillation loss, contrastive loss, and cross-entropy loss. Our exploration investigates the most resilient combination of these losses to create a robust multi-loss function, thus enhancing the robustness of the learned feature space and ameliorating the precision and recall in the retrieval process. Instead of using all the loss functions, the proposed multi-loss function utilizes the combination of only cross-entropy loss, triplet loss, and distillation loss and gains improvement of 6.52% and 3.45% for mean average precision over ISIC-2017 and ISIC-2018. Another innovation in our methodology is a two-branch network strategy, which concurrently boosts image retrieval and classification. Through our experiments, we underscore the effectiveness and the pitfalls of diverse loss configurations in image retrieval. Furthermore, our approach underlines the advantages of retrieval-based classification through majority voting rather than relying solely on the classification head, leading to enhanced prediction for melanoma - the most lethal type of skin cancer. Our results surpass existing state-of-the-art techniques on the ISIC-2017 and ISIC-2018 datasets by improving mean average precision by 1.01% and 4.36% respectively, emphasizing the efficacy and promise of Vision Transformers paired with our tailor-made weighted loss function, especially in medical contexts. The proposed approach's effectiveness is substantiated through thorough ablation studies and an array of quantitative and qualitative outcomes. To promote reproducibility and support forthcoming research, our source code will be accessible on GitHub.

End-to-End Large-Scale Image Retrieval Network with Convolution and Vision Transformers

Investigating the Vision Transformer Model for Image Retrieval Tasks

SemiCVT: Semi-Supervised Convolutional Vision Transformer for Semantic Segmentation

CMT: Convolutional Neural Networks Meet Vision Transformers

Vision Transformer with Convolutions Architecture Search

CTA-Net: A CNN-Transformer Aggregation Network for Improving Multi-Scale Feature Extraction

Echoes of images: multi-loss network for image retrieval in vision transformers

Convolutional Embedding Makes Hierarchical Vision Transformer Stronger

EViTIB: Efficient Vision Transformer Via Inductive Bias Exploration for Image Super-Resolution

DctViT: Discrete Cosine Transform Meet Vision Transformers

A Comprehensive Study of Vision Transformers on Dense Prediction Tasks

Depth-Wise Convolutions in Vision Transformers for Efficient Training on Small Datasets

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

Locality Guidance for Improving Vision Transformers on Tiny Datasets.

A Transformer and Visual Foundation Model-Based Method for Cross-View Remote Sensing Image Retrieval

VisionTwinNet: Gated Clarity Enhancement Paired With Light-Robust CD Transformers

A transformer-CNN parallel network for image guided depth completion

Conformer: Local Features Coupling Global Representations for Visual Recognition

A ConvNet for the 2020s

Demystify Transformers & Convolutions in Modern Image Deep Networks

Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios