Abstract:Multi-modal large language models (MLLMs) have been given free rein to explore exciting medical applications with a primary focus on radiology report generation. Nevertheless, the preliminary success in 2D radiology captioning is incompetent to reflect the real-world diagnostic challenge in the volumetric 3D anatomy. To mitigate three crucial limitation aspects in the existing literature, including (1) data complexity, (2) model capacity, and (3) evaluation metric fidelity, we collected an 18,885 text-scan pairs 3D-BrainCT dataset and applied clinical visual instruction tuning (CVIT) to train BrainGPT models to generate radiology-adherent 3D brain CT reports. Statistically, our BrainGPT scored BLEU-1 = 44.35, BLEU-4 = 20.38, METEOR = 30.13, ROUGE-L = 47.6, and CIDEr-R = 211.77 during internal testing and demonstrated an accuracy of 0.91 in captioning midline shifts on the external validation CQ500 dataset. By further inspecting the captioned report, we reported that the traditional metrics appeared to measure only the surface text similarity and failed to gauge the information density of the diagnostic purpose. To close this gap, we proposed a novel Feature-Oriented Radiology Task Evaluation (FORTE) to estimate the report's clinical relevance (lesion feature and landmarks). Notably, the BrainGPT model scored an average FORTE F1-score of 0.71 (degree=0.661; landmark=0.706; feature=0.693; impression=0.779). To demonstrate that BrainGPT models possess objective readiness to generate human-like radiology reports, we conducted a Turing test that enrolled 11 physician evaluators, and around 74% of the BrainGPT-generated captions were indistinguishable from those written by humans. Our work embodies a holistic framework that showcased the first-hand experience of curating a 3D brain CT dataset, fine-tuning anatomy-sensible language models, and proposing robust radiology evaluation metrics.

What problem does this paper attempt to address?

The paper attempts to address the following issues: 1. **Data complexity in 3D brain CT report generation**: Existing 2D medical image datasets cannot cover the complex 3D anatomical structures, such as pathological features, spatial markers, and lesion extent in the brain, heart, and eyes. Therefore, a 3D image dataset containing this information is needed to train multimodal large language models (MLLM). 2. **Insufficient model capacity**: Directly using undifferentiated foundational MLLMs performs poorly when processing 3D images, especially in identifying lesion locations in specific slices. Therefore, specialized fine-tuning of the model is required to improve its performance on 3D images. 3. **Inaccuracy of evaluation metrics**: Traditional evaluation metrics (such as BLEU, METEOR, ROUGE-L, etc.) are mainly used for assessing short text translation, summarization tasks, and general image descriptions, and cannot measure the clinical relevance of 3D brain CT reports. Therefore, new evaluation metrics need to be designed to assess the clinical value of the generated reports. To address these issues, the authors propose the following methods: - **Constructing the 3D-BrainCT dataset**: Collected a dataset of 18,885 text-scan pairs, containing detailed information on lesion extent, spatial markers, and diagnostic impressions. - **Applying Clinical Visual Instruction Tuning (CVIT)**: Based on the open-source Otter model, trained the BrainGPT model using different instruction tuning methods (such as general instructions, context example instructions, template instructions, and keyword instructions) to enable it to generate 3D brain CT reports that meet clinical needs. - **Proposing Feature-Oriented Radiology Task Evaluation (FORTE)**: Designed a new evaluation metric to assess the clinical relevance of the generated reports from four aspects: lesion extent, spatial markers, features, and impressions. - **Conducting a Turing test**: Verified whether the reports generated by BrainGPT can be distinguished from human-written reports through evaluations by 11 doctors. Through these methods, the authors demonstrate the potential of the BrainGPT model in generating 3D brain CT reports and propose a comprehensive framework that provides a reference for future research.

Towards a Holistic Framework for Multimodal Large Language Models in Three-dimensional Brain CT Report Generation

3D-CT-GPT: Generating 3D Radiology Reports through Integration of Large Vision-Language Models

How Well Do Multi-modal LLMs Interpret CT Scans? An Auto-Evaluation Framework for Analyses

Large Language Model with Region-guided Referring and Grounding for CT Report Generation

Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports

Leveraging Multimodal Models for Enhanced Neuroimaging Diagnostics in Alzheimer's Disease

TRRG: Towards Truthful Radiology Report Generation With Cross-modal Disease Clue Enhanced Large Language Model

Simple Words over Rich Imaging: Accurate Brain Disease Classification via Language Model Analysis of Radiological Reports

Mouse embryos' fusion for the tetraploid complementation assay.

Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography

See Detail Say Clear: Towards Brain CT Report Generation via Pathological Clue-driven Representation Learning

OrthoDoc: Multimodal Large Language Model for Assisting Diagnosis in Computed Tomography

Effectively Fine-tune to Improve Large Multimodal Models for Radiology Report Generation

Multi-modal large language models in radiology: principles, applications, and potential

PeFoMed: Parameter Efficient Fine-tuning of Multimodal Large Language Models for Medical Imaging

Evaluating Large Language Models for Radiology Natural Language Processing

Automatically Generating Narrative-Style Radiology Reports from Volumetric CT Images; a Proof of Concept

Exploring Multimodal Large Language Models for Radiology Report Error-checking

ChatRadio-Valuer: A Chat Large Language Model for Generalizable Radiology Report Generation Based on Multi-institution and Multi-system Data

M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models

Dia-LLaMA: Towards Large Language Model-driven CT Report Generation