Joint-individual fusion structure with fusion attention module for multi-modal skin cancer classification

Peng Tang,Xintong Yan,Yang Nan,Xiaobin Hu,Bjoern H. Menze,Sebastian Krammer,Tobias Lasser
DOI: https://doi.org/10.1016/j.patcog.2024.110604
IF: 8
2024-05-20
Pattern Recognition
Abstract:Many convolutional neural network (CNN) based approaches for skin cancer classification primarily rely on dermatological images, yielding commendable results in classification accuracy. However, leveraging patient metadata, a crucial source of clinical information for dermatologists, can further enhance accuracy. Current methodologies predominantly employ basic joint fusion structures (FS) and fusion modules (FMs) for multi-modal classification, leaving room for advancement in enhancing accuracy through exploration of more sophisticated FS and FM architectures. Thus, this paper introduces a novel fusion method that integrates dermatological images (dermoscopy images or clinical images) with patient metadata for skin cancer classification, focusing on enhancing FS and FM components. Initially, we propose a joint-individual fusion (JIF) structure that simultaneously learns shared features across multi-modality data while preserving specific characteristics. Subsequently, we introduce a multi-modal fusion attention (MMFA) module designed to amplify the most relevant image and metadata features through a combination of self and mutual attention mechanisms, thereby bolstering the decision-making pipeline. Our study compares the efficacy of the proposed JIF-MMFA method with other state-of-the-art fusion techniques across three distinct public datasets. Results demonstrate that the JIF-MMFA method consistently enhances classification outcomes across various CNN backbones, outperforming alternative fusion methodologies on all three datasets. These findings underscore the effectiveness and robustness of our proposed approach in skin cancer classification.
computer science, artificial intelligence,engineering, electrical & electronic
What problem does this paper attempt to address?