Pay Less On Clinical Images: Asymmetric Multi-Modal Fusion Method For Efficient Multi-Label Skin Lesion Classification

Peng Tang,Tobias Lasser
2024-07-14
Abstract:Existing multi-modal approaches primarily focus on enhancing multi-label skin lesion classification performance through advanced fusion modules, often neglecting the associated rise in parameters. In clinical settings, both clinical and dermoscopy images are captured for diagnosis; however, dermoscopy images exhibit more crucial visual features for multi-label skin lesion classification. Motivated by this observation, we introduce a novel asymmetric multi-modal fusion method in this paper for efficient multi-label skin lesion classification. Our fusion method incorporates two innovative schemes. Firstly, we validate the effectiveness of our asymmetric fusion structure. It employs a light and simple network for clinical images and a heavier, more complex one for dermoscopy images, resulting in significant parameter savings compared to the symmetric fusion structure using two identical networks for both modalities. Secondly, in contrast to previous approaches using mutual attention modules for interaction between image modalities, we propose an asymmetric attention module. This module solely leverages clinical image information to enhance dermoscopy image features, considering clinical images as supplementary information in our pipeline. We conduct the extensive experiments on the seven-point checklist dataset. Results demonstrate the generality of our proposed method for both networks and Transformer structures, showcasing its superiority over existing methods We will make our code publicly available.
Image and Video Processing,Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### The Problem Addressed by the Paper The paper primarily addresses the trade-off between the number of parameters and accuracy in Multi-Modal Skin Lesion Classification (MM-SLC). Specifically: 1. **Reducing the Number of Parameters**: Existing multi-modal methods typically focus on improving the performance of multi-label skin lesion classification through advanced fusion modules, but often overlook the increase in the number of parameters. This paper proposes a novel Asymmetric Multi-Modal Fusion Method (AMMFM) aimed at significantly reducing model parameters while only slightly affecting accuracy. 2. **Differences Between Clinical Images and Dermoscopy Images**: In clinical settings, both Clinical Images (CI) and Dermoscopy Images (DI) are usually collected. However, dermoscopy images are more critical for multi-label skin lesion classification. Therefore, the proposed method employs different network structures for these two types of images: - For clinical images, a lightweight network (e.g., MobilenetV3) is used. - For dermoscopy images, a heavier and more complex network (e.g., ResNet, ConvNext, or SwinTransformer) is used. 3. **Asymmetric Attention Mechanism**: Unlike previous Bidirectional Attention Block (BAB), this paper proposes an Asymmetric Attention Block (AAB) that only uses clinical image information to enhance dermoscopy image features. This helps avoid the overfitting problem caused by overemphasizing supplementary information from clinical images. ### Main Contributions 1. A new Asymmetric Fusion Framework (AFF) is proposed, which uses different network structures to extract information from different modalities, significantly reducing model parameters while maintaining or slightly lowering classification accuracy. 2. A new Asymmetric Attention Block (AAB) is introduced, which specifically uses clinical image features to enhance dermoscopy image features, thereby improving classification performance and reducing the number of parameters. 3. The proposed AMMFM method achieves state-of-the-art performance on multiple benchmark datasets and demonstrates its generality and superiority across various deep learning algorithms.