Evaluating GPT-V4 (GPT-4 with Vision) on Detection of Radiologic Findings on Chest Radiographs

Yiliang Zhou,Hanley Ong,Patrick Kennedy,Carol C. Wu,Jacob Kazam,Keith Hentel,Adam Flanders,George Shih,Yifan Peng,Sarah Atzen
DOI: https://doi.org/10.1148/radiol.233270
IF: 19.7
2024-05-08
Radiology
Abstract:Background Generating radiologic findings from chest radiographs is pivotal in medical image analysis. The emergence of OpenAI's generative pretrained transformer, GPT-4 with vision (GPT-4V), has opened new perspectives on the potential for automated image-text pair generation. However, the application of GPT-4V to real-world chest radiography is yet to be thoroughly examined. Purpose To investigate the capability of GPT-4V to generate radiologic findings from real-world chest radiographs....
radiology, nuclear medicine & medical imaging
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the performance of GPT - 4V (a multi - modal large - language model with visual recognition ability) in generating radiological findings of chest X - rays. Specifically, the study aims to examine the ability of GPT - 4V to detect radiological findings from real - world chest X - rays in zero - shot and few - shot settings, especially in the detection of International Classification of Diseases, 10th Revision (ICD - 10) codes and their corresponding anatomical locations (i.e., left and right sides). ### Research Background Generating radiological findings of chest X - rays is crucial for medical image analysis. Recently, fine - tuned pre - trained models have demonstrated the ability to convert image content into text. However, these models are usually trained on a large number of non - specific datasets and may require more domain - specific adjustments to be applicable to chest X - rays. The emergence of OpenAI's Generative Pretrained Transformer GPT - 4V (with visual capabilities) provides a new perspective for automatic image - text pair generation. Although previous studies have explored the performance of GPT - 4 in generating radiological impressions and summarizing clinical trials, the practical application of multi - modal large - language models in interpreting real - world chest X - rays has not been fully studied. ### Research Purpose The purpose of this study is to evaluate the ability of GPT - 4V in generating radiological findings of real - world chest X - rays. ### Methods The study adopts a retrospective design. A total of 100 chest X - rays and their free - text radiological reports were collected and independently annotated by two attending radiologists and three resident physicians to establish a reference standard. These X - rays are from the National Institutes of Health (NIH) Chest X - ray Dataset and the Medical Imaging and Data Resource Center (MIDRC) respectively. ### Results In the zero - shot setting, the average positive predictive value (PPV) of GPT - 4V on the NIH dataset is 12.3%, the average true positive rate (TPR) is 5.8%, and the average F1 - score is 7.3%; on the MIDRC dataset, the average PPV is 25.0%, the average TPR is 16.8%, and the average F1 - score is 18.2%. When considering both ICD - 10 codes and their corresponding anatomical locations, the average PPV of GPT - 4V on the NIH dataset is 7.8%, the average TPR is 3.5%, and the average F1 - score is 4.5%; on the MIDRC dataset, the average PPV is 10.9%, the average TPR is 4.9%, and the average F1 - score is 6.4%. In few - shot learning, the performance of GPT - 4V on both datasets has improved, but there is no significant increase in the average PPV. ### Conclusions Although GPT - 4V shows potential in understanding natural images, its effectiveness in interpreting real - world chest X - rays is limited. The research results indicate that GPT - 4V currently does not have the ability to be used in clinical practice for interpreting chest X - rays. Future research requires larger datasets and more comprehensive evaluations to further develop and improve the application of multi - modal large - language models in this field.