Multi-modal Contrastive-Generative Pre-training for Fine-grained Skin Disease Diagnosis.

Liangdi Ma,Jun Zhao,Guoxin Wang,Yuchen Guo,Feng Xu
DOI: https://doi.org/10.1109/BIBM58861.2023.10385898
2023-01-01
Abstract:Vision-language pre-training (VLP) leverages easily accessible image-text pairs instead of high-cost expert-annotated labels for pre-training, which has achieved promising performance and attracted considerable attention. There are many works on coarse-grained VLP in natural and medical radiology applications. However, as a common problem, the fine-grained setting is still unexplored, especially in medical applications like skin disease diagnosis. In fine-grained cases, the visual appearance of different objects is highly similar, and the language information may be sparse and noisy, both remarkably increasing the difficulty of learning effective features by VLP. In this paper, we address these difficulties and propose a novel Multilevel Multi-modal Contrastive-Generative (M2CG) pre-training method. M2CG has 1) a feature-level multi-modal contrastive module to learn fine-grained features via semantic knowledge, and 2) a semantic-level cross-modal generation module to enforce the model to capture key and discriminative features. We construct a multi-modal skin disease dataset, containing user-taken lesion photos, chief complaints, and consultation dialogues, to perform VLP with M2CG and evaluate the performance on three public skin disease benchmarks and a fine-grained dataset with 64 categories collected from real-world applications. M2CG outperforms the state-of-the-art VLP methods by up to 11.11% in diagnosis accuracy, yielding consistent and significant promotion and facilitating skin disease diagnosis. To our knowledge, this is the first VLP study presented for fine-grained skin disease diagnosis. We believe that the success of M2CG will inspire more innovations in fine-grained VLP for medical practice.
What problem does this paper attempt to address?