MedXChat: A Unified Multimodal Large Language Model Framework towards CXRs Understanding and Generation

Ling Yang,Zhanyu Wang,Zhenghao Chen,Xinyu Liang,Luping Zhou

2024-05-10

Abstract:Multimodal Large Language Models (MLLMs) have shown success in various general image processing tasks, yet their application in medical imaging is nascent, lacking tailored models. This study investigates the potential of MLLMs in improving the understanding and generation of Chest X-Rays (CXRs). We introduce MedXChat, a unified framework facilitating seamless interactions between medical assistants and users for diverse CXR tasks, including text report generation, visual question-answering (VQA), and Text-to-CXR generation. Our MLLMs using natural language as the input breaks task boundaries, maximally simplifying medical professional training by allowing diverse tasks within a single environment. For CXR understanding, we leverage powerful off-the-shelf visual encoders (e.g., ViT) and LLMs (e.g., mPLUG-Owl) to convert medical imagery into language-like features, and subsequently fine-tune our large pre-trained models for medical applications using a visual adapter network and a delta-tuning approach. For CXR generation, we introduce an innovative synthesis approach that utilizes instruction-following capabilities within the Stable Diffusion (SD) architecture. This technique integrates smoothly with the existing model framework, requiring no extra parameters, thereby maintaining the SD's generative strength while also bestowing upon it the capacity to render fine-grained medical images with high fidelity. Through comprehensive experiments, our model demonstrates exceptional cross-task adaptability, displaying adeptness across all three defined tasks. Our MedXChat model and the instruction dataset utilized in this research will be made publicly available to encourage further exploration in the field.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper attempts to address the issue of the inadequacy of multimodal large models in the medical field when processing medical images. Specifically, existing large language models (LLMs) perform well on general image tasks, but in the medical field, there is a lack of multimodal large models capable of handling the diversity of medical images. To solve this problem, the paper proposes MedXChat, a unified multimodal large model aimed at achieving seamless interaction between medical assistants and users. MedXChat mainly addresses the following three specific problems: 1. **CXR-to-Report Generation**: Generating reports from chest X-rays (CXR). 2. **CXR-based Visual Question Answering (VQA)**: Answering users' visual questions based on chest X-rays. 3. **Text-to-CXR Synthesis**: Generating chest X-rays from textual descriptions. Through these functionalities, MedXChat not only demonstrates cross-task adaptability but also surpasses benchmark models in medical multimodal applications on the MIMIC dataset. Additionally, the paper introduces an innovative Text-to-CXR synthesis method, utilizing the instruction-following capability in the Stable Diffusion architecture to generate high-fidelity, fine-grained medical images without additional parameters.

MedXChat: A Unified Multimodal Large Language Model Framework towards CXRs Understanding and Generation

TCMChat: A Generative Large Language Model for Traditional Chinese Medicine

M4CXR: Exploring Multi-task Potentials of Multi-modal Large Language Models for Chest X-ray Interpretation

CXR-LLAVA: a multimodal large language model for interpreting chest X-ray images

ECG-Chat: A Large ECG-Language Model for Cardiac Disease Diagnosis

Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks

LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and Generation

Med-2E3: A 2D-Enhanced 3D Medical Multimodal Large Language Model

XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models

ChatCAD: Interactive Computer-Aided Diagnosis on Medical Image using Large Language Models

DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention

A Refer-and-Ground Multimodal Large Language Model for Biomedicine

Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General Healthcare

Inquire, Interact, and Integrate: A Proactive Agent Collaborative Framework for Zero-Shot Multimodal Medical Reasoning

Med-PMC: Medical Personalized Multi-modal Consultation with a Proactive Ask-First-Observe-Next Paradigm

M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models

R2GenCSR: Retrieving Context Samples for Large Language Model based X-ray Medical Report Generation

MedViLaM: A multimodal large language model with advanced generalizability and explainability for medical data understanding and generation

Enhancing Clinical Accuracy of Medical Chatbots with Large Language Models

ChatCAD+: Towards a Universal and Reliable Interactive CAD using LLMs

RoentGen: Vision-Language Foundation Model for Chest X-ray Generation