Transformer-based interpretable multi-modal data fusion for skin lesion classification

Theodor Cheslerean-Boghiu,Melia-Evelina Fleischmann,Theresa Willem,Tobias Lasser
2023-08-31
Abstract:A lot of deep learning (DL) research these days is mainly focused on improving quantitative metrics regardless of other factors. In human-centered applications, like skin lesion classification in dermatology, DL-driven clinical decision support systems are still in their infancy due to the limited transparency of their decision-making process. Moreover, the lack of procedures that can explain the behavior of trained DL algorithms leads to almost no trust from clinical physicians. To diagnose skin lesions, dermatologists rely on visual assessment of the disease and the data gathered from the patient's anamnesis. Data-driven algorithms dealing with multi-modal data are limited by the separation of feature-level and decision-level fusion procedures required by convolutional architectures. To address this issue, we enable single-stage multi-modal data fusion via the attention mechanism of transformer-based architectures to aid in diagnosing skin diseases. Our method beats other state-of-the-art single- and multi-modal DL architectures in image-rich and patient-data-rich environments. Additionally, the choice of the architecture enables native interpretability support for the classification task both in the image and metadata domain with no additional modifications necessary.
Image and Video Processing,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The paper primarily addresses the issue of skin lesion classification by proposing a multimodal data fusion method based on Transformer, aiming to improve diagnostic accuracy and enhance model interpretability. Specifically, the paper attempts to solve the following key problems: 1. **Improving the accuracy of skin lesion classification**: By combining image information (such as the visual features of skin lesions) with patient history and other metadata, the Transformer architecture is used to handle multimodal data fusion, with the goal of achieving better classification results than using images or metadata alone. 2. **Enhancing model transparency and interpretability**: In the medical field, particularly when dermatologists diagnose skin lesions, it is essential to understand how machine learning models make decisions. Therefore, this study focuses on making the model's decision-making process more transparent so that clinicians can trust and adopt these deep learning-based decision support systems. 3. **Addressing the limitations of existing deep learning models**: Many current deep learning models (especially those based on convolutional neural networks) have limitations when processing multimodal data, such as the separation of feature-level fusion and decision-level fusion, which restricts model performance. The proposed method in this paper aims to overcome these limitations by achieving single-stage multimodal data fusion. 4. **Evaluating the impact of different metadata combinations**: The paper also explores the impact of different quantities and types of metadata on model performance and demonstrates that proper metadata engineering can further improve classification performance. In summary, the goal of this paper is to improve diagnostic accuracy in the task of skin lesion classification through multimodal data fusion and to increase clinicians' trust in such systems by enhancing model interpretability. Additionally, the study focuses on optimizing model performance by selecting appropriate metadata.