MAIRA-1: A specialised large multimodal model for radiology report generation

Stephanie L. Hyland,Shruthi Bannur,Kenza Bouzid,Daniel C. Castro,Mercy Ranjit,Anton Schwaighofer,Fernando Pérez-García,Valentina Salvatelli,Shaury Srivastav,Anja Thieme,Noel Codella,Matthew P. Lungren,Maria Teodora Wetscherek,Ozan Oktay,Javier Alvarez-Valle
2024-04-27
Abstract:We present a radiology-specific multimodal model for the task for generating radiological reports from chest X-rays (CXRs). Our work builds on the idea that large language model(s) can be equipped with multimodal capabilities through alignment with pre-trained vision encoders. On natural images, this has been shown to allow multimodal models to gain image understanding and description capabilities. Our proposed model (MAIRA-1) leverages a CXR-specific image encoder in conjunction with a fine-tuned large language model based on Vicuna-7B, and text-based data augmentation, to produce reports with state-of-the-art quality. In particular, MAIRA-1 significantly improves on the radiologist-aligned RadCliQ metric and across all lexical metrics considered. Manual review of model outputs demonstrates promising fluency and accuracy of generated reports while uncovering failure modes not captured by existing evaluation practices. More information and resources can be found on the project website: <a class="link-external link-https" href="https://aka.ms/maira" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem this paper attempts to address is the generation of high-quality "Findings" sections in chest X-ray (CXR) radiology reports. Specifically, the authors propose a multimodal model specifically designed for radiology (MAIRA-1), which can generate the "Findings" section of a radiology report from a single chest X-ray and its examination indications. Unlike traditional image description tasks, radiology reports require detailed descriptions of both abnormal and normal findings in the images, and these findings are often very subtle structural changes. Therefore, generating high-quality radiology reports is a challenging multimodal task. The main contributions of the paper include: 1. **Model Architecture**: MAIRA-1 combines a pre-trained radiology-specific image encoder (RAD-DINO) and a fine-tuned large language model (Vicuna-7B), and improves the quality of the generated reports through text data augmentation. 2. **Performance Evaluation**: The model's superior performance in generating radiology reports is demonstrated through various evaluation metrics, including lexical metrics and radiology-specific metrics. 3. **Impact of Design Choices**: The paper explores the impact of different design choices (such as using a domain-specific image encoder, increasing the size of adapters, using GPT-enhanced data, etc.) on the model's performance. Overall, this paper aims to improve the accuracy and fluency of generated radiology reports by developing a multimodal model specifically designed for radiology, thereby improving and accelerating the radiology reporting workflow.