Abstract:Medicine is inherently multimodal and multitask, with diverse data modalities spanning text, imaging. However, most models in medical field are unimodal single tasks and lack good generalizability and explainability. In this study, we introduce MedViLaM, a unified vision-language model towards a generalist model for medical data that can flexibly encode and interpret various forms of medical data, including clinical language and imaging, all using the same set of model weights. To facilitate the creation of such multi-task model, we have curated MultiMedBench, a comprehensive pretaining dataset and benchmark consisting of several distinct tasks, i.e., continuous question-answering, multi-label disease classification, disease localization, generation and summarization of radiology reports. MedViLaM demonstrates strong performance across all MultiMedBench tasks, frequently outpacing other generalist models by a significant margin. Additionally, we present instances of zero-shot generalization to new medical concepts and tasks, effective transfer learning across different tasks, and the emergence of zero-shot medical reasoning.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the issues of understanding and generating multimodal data in the medical field. Specifically, the paper attempts to solve the following problems: 1. **Multimodal and Multi-task Processing**: - Medical data is inherently multimodal and multi-task, including various forms of data such as text and images. However, most existing medical models are unimodal and single-task, lacking good generalization and interpretability. 2. **Model Generalization Ability**: - Existing deep learning models perform poorly on data from different medical centers, especially when there is a significant difference or domain shift between the training and testing data. This limits the effectiveness of these models in practical clinical applications. 3. **Model Interpretability**: - Deep learning models are often "black box" models, lacking the ability to explain their decision-making process. This opacity leads to low trust from doctors, who are accustomed to interpretable clinical reasoning. ### Solution To address the above issues, the authors propose **MedViLaM** (Medical Vision-Language Model), a unified vision-language model with the following features: 1. **Multimodal and Multi-task Processing**: - MedViLaM can flexibly encode and interpret various forms of medical data, including clinical language and images, using the same set of model weights. 2. **Enhanced Generalization Ability**: - Through a carefully designed instruction tuning framework and diverse training strategies, MedViLaM can effectively extract relevant features and make predictions across various medical imaging tasks, demonstrating strong generalization ability. 3. **Enhanced Interpretability**: - Leveraging large language models, MedViLaM can improve the interpretability of diagnostic results through detailed disease descriptions and accurate annotation of lesion locations. For example, it can preliminarily verify the disease category, severity, and approximate location through classification tasks, provide more precise lesion bounding boxes through localization tasks, and assess the extent of the disease through segmentation functions. ### Experimental Results The paper demonstrates the performance of MedViLaM on multiple medical benchmark datasets, including but not limited to: - **Disease Classification and Localization**: On chest X-ray datasets, MedViLaM shows competitive performance in classification and localization tasks. - **Video and Audio Analysis**: On endoscopy datasets, MedViLaM can accurately classify and locate abnormalities, polyps, instruments, etc. - **Unseen Disease Diagnosis and Foreign Object Detection**: MedViLaM exhibits zero-shot generalization ability in tasks involving the diagnosis of unseen diseases and the detection of foreign objects in chest X-rays. ### Conclusion Through joint training of multimodal and multi-task data, MedViLaM significantly improves the generalization and interpretability of medical data understanding and generation, providing potential support for future clinical applications.

MedViLaM: A multimodal large language model with advanced generalizability and explainability for medical data understanding and generation

Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks

VividMed: Vision Language Model with Versatile Visual Grounding for Medicine

Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General Healthcare

Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training

Advancing High Resolution Vision-Language Models in Biomedicine

MOSS-MED: Medical Multimodal Model Serving Medical Image Analysis

A Generalist Learner for Multifaceted Medical Image Interpretation

GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI

Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models

LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound

Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE

SemiHVision: Enhancing Medical Multimodal Models with a Semi-Human Annotated Dataset and Fine-Tuned Instruction Generation

MedXChat: A Unified Multimodal Large Language Model Framework towards CXRs Understanding and Generation

Towards Evaluating and Building Versatile Large Language Models for Medicine

MMedAgent: Learning to Use Medical Tools with Multi-modal Agent

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

Med-2E3: A 2D-Enhanced 3D Medical Multimodal Large Language Model

OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM

Multimodal Large Language Models are Generalist Medical Image Interpreters