MedViLaM: A multimodal large language model with advanced generalizability and explainability for medical data understanding and generation

Lijian Xu,Hao Sun,Ziyu Ni,Hongsheng Li,Shaoting Zhang
2024-09-29
Abstract:Medicine is inherently multimodal and multitask, with diverse data modalities spanning text, imaging. However, most models in medical field are unimodal single tasks and lack good generalizability and explainability. In this study, we introduce MedViLaM, a unified vision-language model towards a generalist model for medical data that can flexibly encode and interpret various forms of medical data, including clinical language and imaging, all using the same set of model weights. To facilitate the creation of such multi-task model, we have curated MultiMedBench, a comprehensive pretaining dataset and benchmark consisting of several distinct tasks, i.e., continuous question-answering, multi-label disease classification, disease localization, generation and summarization of radiology reports. MedViLaM demonstrates strong performance across all MultiMedBench tasks, frequently outpacing other generalist models by a significant margin. Additionally, we present instances of zero-shot generalization to new medical concepts and tasks, effective transfer learning across different tasks, and the emergence of zero-shot medical reasoning.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the issues of understanding and generating multimodal data in the medical field. Specifically, the paper attempts to solve the following problems: 1. **Multimodal and Multi-task Processing**: - Medical data is inherently multimodal and multi-task, including various forms of data such as text and images. However, most existing medical models are unimodal and single-task, lacking good generalization and interpretability. 2. **Model Generalization Ability**: - Existing deep learning models perform poorly on data from different medical centers, especially when there is a significant difference or domain shift between the training and testing data. This limits the effectiveness of these models in practical clinical applications. 3. **Model Interpretability**: - Deep learning models are often "black box" models, lacking the ability to explain their decision-making process. This opacity leads to low trust from doctors, who are accustomed to interpretable clinical reasoning. ### Solution To address the above issues, the authors propose **MedViLaM** (Medical Vision-Language Model), a unified vision-language model with the following features: 1. **Multimodal and Multi-task Processing**: - MedViLaM can flexibly encode and interpret various forms of medical data, including clinical language and images, using the same set of model weights. 2. **Enhanced Generalization Ability**: - Through a carefully designed instruction tuning framework and diverse training strategies, MedViLaM can effectively extract relevant features and make predictions across various medical imaging tasks, demonstrating strong generalization ability. 3. **Enhanced Interpretability**: - Leveraging large language models, MedViLaM can improve the interpretability of diagnostic results through detailed disease descriptions and accurate annotation of lesion locations. For example, it can preliminarily verify the disease category, severity, and approximate location through classification tasks, provide more precise lesion bounding boxes through localization tasks, and assess the extent of the disease through segmentation functions. ### Experimental Results The paper demonstrates the performance of MedViLaM on multiple medical benchmark datasets, including but not limited to: - **Disease Classification and Localization**: On chest X-ray datasets, MedViLaM shows competitive performance in classification and localization tasks. - **Video and Audio Analysis**: On endoscopy datasets, MedViLaM can accurately classify and locate abnormalities, polyps, instruments, etc. - **Unseen Disease Diagnosis and Foreign Object Detection**: MedViLaM exhibits zero-shot generalization ability in tasks involving the diagnosis of unseen diseases and the detection of foreign objects in chest X-rays. ### Conclusion Through joint training of multimodal and multi-task data, MedViLaM significantly improves the generalization and interpretability of medical data understanding and generation, providing potential support for future clinical applications.