MedXChat: A Unified Multimodal Large Language Model Framework towards CXRs Understanding and Generation

Ling Yang,Zhanyu Wang,Zhenghao Chen,Xinyu Liang,Luping Zhou
2024-05-10
Abstract:Multimodal Large Language Models (MLLMs) have shown success in various general image processing tasks, yet their application in medical imaging is nascent, lacking tailored models. This study investigates the potential of MLLMs in improving the understanding and generation of Chest X-Rays (CXRs). We introduce MedXChat, a unified framework facilitating seamless interactions between medical assistants and users for diverse CXR tasks, including text report generation, visual question-answering (VQA), and Text-to-CXR generation. Our MLLMs using natural language as the input breaks task boundaries, maximally simplifying medical professional training by allowing diverse tasks within a single environment. For CXR understanding, we leverage powerful off-the-shelf visual encoders (e.g., ViT) and LLMs (e.g., mPLUG-Owl) to convert medical imagery into language-like features, and subsequently fine-tune our large pre-trained models for medical applications using a visual adapter network and a delta-tuning approach. For CXR generation, we introduce an innovative synthesis approach that utilizes instruction-following capabilities within the Stable Diffusion (SD) architecture. This technique integrates smoothly with the existing model framework, requiring no extra parameters, thereby maintaining the SD's generative strength while also bestowing upon it the capacity to render fine-grained medical images with high fidelity. Through comprehensive experiments, our model demonstrates exceptional cross-task adaptability, displaying adeptness across all three defined tasks. Our MedXChat model and the instruction dataset utilized in this research will be made publicly available to encourage further exploration in the field.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the issue of the inadequacy of multimodal large models in the medical field when processing medical images. Specifically, existing large language models (LLMs) perform well on general image tasks, but in the medical field, there is a lack of multimodal large models capable of handling the diversity of medical images. To solve this problem, the paper proposes MedXChat, a unified multimodal large model aimed at achieving seamless interaction between medical assistants and users. MedXChat mainly addresses the following three specific problems: 1. **CXR-to-Report Generation**: Generating reports from chest X-rays (CXR). 2. **CXR-based Visual Question Answering (VQA)**: Answering users' visual questions based on chest X-rays. 3. **Text-to-CXR Synthesis**: Generating chest X-rays from textual descriptions. Through these functionalities, MedXChat not only demonstrates cross-task adaptability but also surpasses benchmark models in medical multimodal applications on the MIMIC dataset. Additionally, the paper introduces an innovative Text-to-CXR synthesis method, utilizing the instruction-following capability in the Stable Diffusion architecture to generate high-fidelity, fine-grained medical images without additional parameters.