Abstract:This paper introduces a novel approach to enhance content-based image retrieval, validated on two benchmark datasets: ISIC-2017 and ISIC-2018. These datasets comprise skin lesion images that are crucial for innovations in skin cancer diagnosis and treatment. We advocate the use of pre-trained Vision Transformer (ViT), a relatively uncharted concept in the realm of image retrieval, particularly in medical scenarios. In contrast to the traditionally employed Convolutional Neural Networks (CNNs), our findings suggest that ViT offers a more comprehensive understanding of the image context, essential in medical imaging. We further incorporate a weighted multi-loss function, delving into various losses such as triplet loss, distillation loss, contrastive loss, and cross-entropy loss. Our exploration investigates the most resilient combination of these losses to create a robust multi-loss function, thus enhancing the robustness of the learned feature space and ameliorating the precision and recall in the retrieval process. Instead of using all the loss functions, the proposed multi-loss function utilizes the combination of only cross-entropy loss, triplet loss, and distillation loss and gains improvement of 6.52% and 3.45% for mean average precision over ISIC-2017 and ISIC-2018. Another innovation in our methodology is a two-branch network strategy, which concurrently boosts image retrieval and classification. Through our experiments, we underscore the effectiveness and the pitfalls of diverse loss configurations in image retrieval. Furthermore, our approach underlines the advantages of retrieval-based classification through majority voting rather than relying solely on the classification head, leading to enhanced prediction for melanoma - the most lethal type of skin cancer. Our results surpass existing state-of-the-art techniques on the ISIC-2017 and ISIC-2018 datasets by improving mean average precision by 1.01% and 4.36% respectively, emphasizing the efficacy and promise of Vision Transformers paired with our tailor-made weighted loss function, especially in medical contexts. The proposed approach's effectiveness is substantiated through thorough ablation studies and an array of quantitative and qualitative outcomes. To promote reproducibility and support forthcoming research, our source code will be accessible on GitHub.

OCFormer: One-Class Transformer Network for Image Classification

Transformer in Optronic Neural Networks for Image Classification

Less complexity one-class classification approach using construction error of convolutional image transformation network

OneCAD: One Classifier for All image Datasets using multimodal learning

Vision Transformers for Remote Sensing Image Classification

Analyzing Vision Transformers for Image Classification in Class Embedding Space

Do Vision Transformers See Like Convolutional Neural Networks?

Vision Conformer: Incorporating Convolutions into Vision Transformer Layers

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

OVO: One-shot Vision Transformer Search with Online distillation

A Vision Transformer Architecture for Open Set Recognition

Asymmetric Vision Transformers for Multi-Label Classification

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Vicinity Vision Transformer

Multi-Attribute Vision Transformers are Efficient and Robust Learners

MedViT: A robust vision transformer for generalized medical image classification

Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition

Ensemble of vision transformer architectures for efficient Alzheimer's Disease classification

Echoes of images: multi-loss network for image retrieval in vision transformers