Towards a Holistic Framework for Multimodal Large Language Models in Three-dimensional Brain CT Report Generation

Cheng-Yi Li,Kao-Jung Chang,Cheng-Fu Yang,Hsin-Yu Wu,Wenting Chen,Hritik Bansal,Ling Chen,Yi-Ping Yang,Yu-Chun Chen,Shih-Pin Chen,Jiing-Feng Lirng,Kai-Wei Chang,Shih-Hwa Chiou
2024-07-02
Abstract:Multi-modal large language models (MLLMs) have been given free rein to explore exciting medical applications with a primary focus on radiology report generation. Nevertheless, the preliminary success in 2D radiology captioning is incompetent to reflect the real-world diagnostic challenge in the volumetric 3D anatomy. To mitigate three crucial limitation aspects in the existing literature, including (1) data complexity, (2) model capacity, and (3) evaluation metric fidelity, we collected an 18,885 text-scan pairs 3D-BrainCT dataset and applied clinical visual instruction tuning (CVIT) to train BrainGPT models to generate radiology-adherent 3D brain CT reports. Statistically, our BrainGPT scored BLEU-1 = 44.35, BLEU-4 = 20.38, METEOR = 30.13, ROUGE-L = 47.6, and CIDEr-R = 211.77 during internal testing and demonstrated an accuracy of 0.91 in captioning midline shifts on the external validation CQ500 dataset. By further inspecting the captioned report, we reported that the traditional metrics appeared to measure only the surface text similarity and failed to gauge the information density of the diagnostic purpose. To close this gap, we proposed a novel Feature-Oriented Radiology Task Evaluation (FORTE) to estimate the report's clinical relevance (lesion feature and landmarks). Notably, the BrainGPT model scored an average FORTE F1-score of 0.71 (degree=0.661; landmark=0.706; feature=0.693; impression=0.779). To demonstrate that BrainGPT models possess objective readiness to generate human-like radiology reports, we conducted a Turing test that enrolled 11 physician evaluators, and around 74% of the BrainGPT-generated captions were indistinguishable from those written by humans. Our work embodies a holistic framework that showcased the first-hand experience of curating a 3D brain CT dataset, fine-tuning anatomy-sensible language models, and proposing robust radiology evaluation metrics.
Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the following issues: 1. **Data complexity in 3D brain CT report generation**: Existing 2D medical image datasets cannot cover the complex 3D anatomical structures, such as pathological features, spatial markers, and lesion extent in the brain, heart, and eyes. Therefore, a 3D image dataset containing this information is needed to train multimodal large language models (MLLM). 2. **Insufficient model capacity**: Directly using undifferentiated foundational MLLMs performs poorly when processing 3D images, especially in identifying lesion locations in specific slices. Therefore, specialized fine-tuning of the model is required to improve its performance on 3D images. 3. **Inaccuracy of evaluation metrics**: Traditional evaluation metrics (such as BLEU, METEOR, ROUGE-L, etc.) are mainly used for assessing short text translation, summarization tasks, and general image descriptions, and cannot measure the clinical relevance of 3D brain CT reports. Therefore, new evaluation metrics need to be designed to assess the clinical value of the generated reports. To address these issues, the authors propose the following methods: - **Constructing the 3D-BrainCT dataset**: Collected a dataset of 18,885 text-scan pairs, containing detailed information on lesion extent, spatial markers, and diagnostic impressions. - **Applying Clinical Visual Instruction Tuning (CVIT)**: Based on the open-source Otter model, trained the BrainGPT model using different instruction tuning methods (such as general instructions, context example instructions, template instructions, and keyword instructions) to enable it to generate 3D brain CT reports that meet clinical needs. - **Proposing Feature-Oriented Radiology Task Evaluation (FORTE)**: Designed a new evaluation metric to assess the clinical relevance of the generated reports from four aspects: lesion extent, spatial markers, features, and impressions. - **Conducting a Turing test**: Verified whether the reports generated by BrainGPT can be distinguished from human-written reports through evaluations by 11 doctors. Through these methods, the authors demonstrate the potential of the BrainGPT model in generating 3D brain CT reports and propose a comprehensive framework that provides a reference for future research.