How Well Do Multi-modal LLMs Interpret CT Scans? An Auto-Evaluation Framework for Analyses

Qingqing Zhu,Benjamin Hou,Tejas S. Mathai,Pritam Mukherjee,Qiao Jin,Xiuying Chen,Zhizheng Wang,Ruida Cheng,Ronald M. Summers,Zhiyong Lu

2024-06-18

Abstract:Automatically interpreting CT scans can ease the workload of radiologists. However, this is challenging mainly due to the scarcity of adequate datasets and reference standards for evaluation. This study aims to bridge this gap by introducing a novel evaluation framework, named ``GPTRadScore''. This framework assesses the capabilities of multi-modal LLMs, such as GPT-4 with Vision (GPT-4V), Gemini Pro Vision, LLaVA-Med, and RadFM, in generating descriptions for prospectively-identified findings. By employing a decomposition technique based on GPT-4, GPTRadScore compares these generated descriptions with gold-standard report sentences, analyzing their accuracy in terms of body part, location, and type of finding. Evaluations demonstrated a high correlation with clinician assessments and highlighted its potential over traditional metrics, such as BLEU, METEOR, and ROUGE. Furthermore, to contribute to future studies, we plan to release a benchmark dataset annotated by clinicians. Using GPTRadScore, we found that while GPT-4V and Gemini Pro Vision fare better, their performance revealed significant areas for improvement, primarily due to limitations in the dataset used for training these models. To demonstrate this potential, RadFM was fine-tuned and it resulted in significant accuracy improvements: location accuracy rose from 3.41\% to 12.8\%, body part accuracy from 29.12\% to 53\%, and type accuracy from 9.24\% to 30\%, thereby validating our hypothesis.

Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

This paper discusses how to effectively evaluate the performance of multimodal large language models (LLMs) in interpreting CT scans. Currently, automatic interpretation of CT scans can alleviate the workload of radiologists, but the lack of sufficient datasets and evaluation criteria is a challenge. To address this, researchers propose a novel evaluation framework called "GPTRadScore". GPTRadScore uses GPT-4 decomposition techniques to compare the accuracy of model-generated descriptions with clinical reference report sentences, focusing on key information such as body part, location, and findings type. The evaluation results show that GPT-4V and Gemini Pro Vision perform well, but there is still room for improvement, mainly due to limitations in the training dataset. By fine-tuning RadFM, its accuracy in location, body part, and type recognition is significantly improved. The paper points out that although previous work has attempted to use LLMs to generate radiology reports, current evaluation metrics such as BLEU, METEOR, and ROUGE have limitations in capturing the semantic richness and clinical relevance required for radiology reports. GPTRadScore is closer to clinical evaluation and plans to release benchmark datasets annotated by clinical doctors to promote future research. The paper also mentions the potential of LLMs in tasks such as medical decision-making, information extraction, and disease diagnosis. However, their clinical application needs to be built on the trust of radiologists and the ease of understanding and evaluating generated content. GPTRadScore provides a more accurate evaluation of the performance of multimodal LLMs in CT scan interpretation by simulating the clinical assessment process.

How Well Do Multi-modal LLMs Interpret CT Scans? An Auto-Evaluation Framework for Analyses

Towards a Holistic Framework for Multimodal Large Language Models in Three-dimensional Brain CT Report Generation

LLM-RadJudge: Achieving Radiologist-Level Evaluation for X-Ray Report Generation

Leveraging Professional Radiologists' Expertise to Enhance LLMs' Evaluation for Radiology Reports

3D-CT-GPT: Generating 3D Radiology Reports through Integration of Large Vision-Language Models

Mouse embryos' fusion for the tetraploid complementation assay.

A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical Imaging

Exploring the Boundaries of GPT-4 in Radiology

OrthoDoc: Multimodal Large Language Model for Assisting Diagnosis in Computed Tomography

A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical Image Analysis

M4CXR: Exploring Multi-task Potentials of Multi-modal Large Language Models for Chest X-ray Interpretation

Assessing GPT-4 multimodal performance in radiological image analysis

MRScore: Evaluating Radiology Report Generation with LLM-based Reward System

Enhancing radiology training with GPT-4: Pilot analysis of automated feedback in trainee preliminary reports

Merlin: A Vision Language Foundation Model for 3D Computed Tomography

Comparative Analysis of GPT-4Vision, GPT-4 and Open Source LLMs in Clinical Diagnostic Accuracy: A Benchmark Against Human Expertise

Large Language Model with Region-guided Referring and Grounding for CT Report Generation

Step into the era of large multimodal models: a pilot study on ChatGPT-4V(ision)'s ability to interpret radiological images

Can GPT-4V(ision) Serve Medical Applications? Case Studies on GPT-4V for Multimodal Medical Diagnosis

Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports