How Well Do Multi-modal LLMs Interpret CT Scans? An Auto-Evaluation Framework for Analyses

Qingqing Zhu,Benjamin Hou,Tejas S. Mathai,Pritam Mukherjee,Qiao Jin,Xiuying Chen,Zhizheng Wang,Ruida Cheng,Ronald M. Summers,Zhiyong Lu
2024-06-18
Abstract:Automatically interpreting CT scans can ease the workload of radiologists. However, this is challenging mainly due to the scarcity of adequate datasets and reference standards for evaluation. This study aims to bridge this gap by introducing a novel evaluation framework, named ``GPTRadScore''. This framework assesses the capabilities of multi-modal LLMs, such as GPT-4 with Vision (GPT-4V), Gemini Pro Vision, LLaVA-Med, and RadFM, in generating descriptions for prospectively-identified findings. By employing a decomposition technique based on GPT-4, GPTRadScore compares these generated descriptions with gold-standard report sentences, analyzing their accuracy in terms of body part, location, and type of finding. Evaluations demonstrated a high correlation with clinician assessments and highlighted its potential over traditional metrics, such as BLEU, METEOR, and ROUGE. Furthermore, to contribute to future studies, we plan to release a benchmark dataset annotated by clinicians. Using GPTRadScore, we found that while GPT-4V and Gemini Pro Vision fare better, their performance revealed significant areas for improvement, primarily due to limitations in the dataset used for training these models. To demonstrate this potential, RadFM was fine-tuned and it resulted in significant accuracy improvements: location accuracy rose from 3.41\% to 12.8\%, body part accuracy from 29.12\% to 53\%, and type accuracy from 9.24\% to 30\%, thereby validating our hypothesis.
Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper discusses how to effectively evaluate the performance of multimodal large language models (LLMs) in interpreting CT scans. Currently, automatic interpretation of CT scans can alleviate the workload of radiologists, but the lack of sufficient datasets and evaluation criteria is a challenge. To address this, researchers propose a novel evaluation framework called "GPTRadScore". GPTRadScore uses GPT-4 decomposition techniques to compare the accuracy of model-generated descriptions with clinical reference report sentences, focusing on key information such as body part, location, and findings type. The evaluation results show that GPT-4V and Gemini Pro Vision perform well, but there is still room for improvement, mainly due to limitations in the training dataset. By fine-tuning RadFM, its accuracy in location, body part, and type recognition is significantly improved. The paper points out that although previous work has attempted to use LLMs to generate radiology reports, current evaluation metrics such as BLEU, METEOR, and ROUGE have limitations in capturing the semantic richness and clinical relevance required for radiology reports. GPTRadScore is closer to clinical evaluation and plans to release benchmark datasets annotated by clinical doctors to promote future research. The paper also mentions the potential of LLMs in tasks such as medical decision-making, information extraction, and disease diagnosis. However, their clinical application needs to be built on the trust of radiologists and the ease of understanding and evaluating generated content. GPTRadScore provides a more accurate evaluation of the performance of multimodal LLMs in CT scan interpretation by simulating the clinical assessment process.