Abstract:Multi-modal large language models (MLLMs) have been given free rein to explore exciting medical applications with a primary focus on radiology report generation. Nevertheless, the preliminary success in 2D radiology captioning is incompetent to reflect the real-world diagnostic challenge in the volumetric 3D anatomy. To mitigate three crucial limitation aspects in the existing literature, including (1) data complexity, (2) model capacity, and (3) evaluation metric fidelity, we collected an 18,885 text-scan pairs 3D-BrainCT dataset and applied clinical visual instruction tuning (CVIT) to train BrainGPT models to generate radiology-adherent 3D brain CT reports. Statistically, our BrainGPT scored BLEU-1 = 44.35, BLEU-4 = 20.38, METEOR = 30.13, ROUGE-L = 47.6, and CIDEr-R = 211.77 during internal testing and demonstrated an accuracy of 0.91 in captioning midline shifts on the external validation CQ500 dataset. By further inspecting the captioned report, we reported that the traditional metrics appeared to measure only the surface text similarity and failed to gauge the information density of the diagnostic purpose. To close this gap, we proposed a novel Feature-Oriented Radiology Task Evaluation (FORTE) to estimate the report's clinical relevance (lesion feature and landmarks). Notably, the BrainGPT model scored an average FORTE F1-score of 0.71 (degree=0.661; landmark=0.706; feature=0.693; impression=0.779). To demonstrate that BrainGPT models possess objective readiness to generate human-like radiology reports, we conducted a Turing test that enrolled 11 physician evaluators, and around 74% of the BrainGPT-generated captions were indistinguishable from those written by humans. Our work embodies a holistic framework that showcased the first-hand experience of curating a 3D brain CT dataset, fine-tuning anatomy-sensible language models, and proposing robust radiology evaluation metrics.

ViT3D Alignment of LLaMA3: 3D Medical Image Report Generation

3D-CT-GPT: Generating 3D Radiology Reports through Integration of Large Vision-Language Models

Towards a Holistic Framework for Multimodal Large Language Models in Three-dimensional Brain CT Report Generation

Harnessing the Power of Pre-trained Vision-Language Models for Efficient Medical Report Generation

M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models

VividMed: Vision Language Model with Versatile Visual Grounding for Medicine

Dia-LLaMA: Towards Large Language Model-driven CT Report Generation

Med-2E3: A 2D-Enhanced 3D Medical Multimodal Large Language Model

Automatic Medical Report Generation Based on Cross-View Attention and Visual-Semantic Long Short Term Memorys

MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models

Gla-AI4BioMed at RRG24: Visual Instruction-tuned Adaptation for Radiology Report Generation

ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue

LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound

Customizing General-Purpose Foundation Models for Medical Report Generation

GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI

Anatomical Structure-Guided Medical Vision-Language Pre-training

LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model

MOSS-MED: Medical Multimodal Model Serving Medical Image Analysis

Resource-Efficient Medical Report Generation using Large Language Models

AlignTransformer: Hierarchical Alignment of Visual Regions and Disease Tags for Medical Report Generation

MRScore: Evaluating Radiology Report Generation with LLM-based Reward System