What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the performance of GPT - 4V (a multi - modal large - language model with visual recognition ability) in generating radiological findings of chest X - rays. Specifically, the study aims to examine the ability of GPT - 4V to detect radiological findings from real - world chest X - rays in zero - shot and few - shot settings, especially in the detection of International Classification of Diseases, 10th Revision (ICD - 10) codes and their corresponding anatomical locations (i.e., left and right sides). ### Research Background Generating radiological findings of chest X - rays is crucial for medical image analysis. Recently, fine - tuned pre - trained models have demonstrated the ability to convert image content into text. However, these models are usually trained on a large number of non - specific datasets and may require more domain - specific adjustments to be applicable to chest X - rays. The emergence of OpenAI's Generative Pretrained Transformer GPT - 4V (with visual capabilities) provides a new perspective for automatic image - text pair generation. Although previous studies have explored the performance of GPT - 4 in generating radiological impressions and summarizing clinical trials, the practical application of multi - modal large - language models in interpreting real - world chest X - rays has not been fully studied. ### Research Purpose The purpose of this study is to evaluate the ability of GPT - 4V in generating radiological findings of real - world chest X - rays. ### Methods The study adopts a retrospective design. A total of 100 chest X - rays and their free - text radiological reports were collected and independently annotated by two attending radiologists and three resident physicians to establish a reference standard. These X - rays are from the National Institutes of Health (NIH) Chest X - ray Dataset and the Medical Imaging and Data Resource Center (MIDRC) respectively. ### Results In the zero - shot setting, the average positive predictive value (PPV) of GPT - 4V on the NIH dataset is 12.3%, the average true positive rate (TPR) is 5.8%, and the average F1 - score is 7.3%; on the MIDRC dataset, the average PPV is 25.0%, the average TPR is 16.8%, and the average F1 - score is 18.2%. When considering both ICD - 10 codes and their corresponding anatomical locations, the average PPV of GPT - 4V on the NIH dataset is 7.8%, the average TPR is 3.5%, and the average F1 - score is 4.5%; on the MIDRC dataset, the average PPV is 10.9%, the average TPR is 4.9%, and the average F1 - score is 6.4%. In few - shot learning, the performance of GPT - 4V on both datasets has improved, but there is no significant increase in the average PPV. ### Conclusions Although GPT - 4V shows potential in understanding natural images, its effectiveness in interpreting real - world chest X - rays is limited. The research results indicate that GPT - 4V currently does not have the ability to be used in clinical practice for interpreting chest X - rays. Future research requires larger datasets and more comprehensive evaluations to further develop and improve the application of multi - modal large - language models in this field.

Evaluating GPT-V4 (GPT-4 with Vision) on Detection of Radiologic Findings on Chest Radiographs

Evaluating GPT-4 with Vision on Detection of Radiological Findings on Chest Radiographs

From Text to Image: Exploring GPT-4Vision's Potential in Advanced Radiological Analysis across Subspecialties

GPT-4 Vision: Multi-Modal Evolution of ChatGPT and Potential Role in Radiology

Performance of GPT-4 with Vision on Text- and Image-based ACR Diagnostic Radiology In-Training Examination Questions

GPT-4V Cannot Generate Radiology Reports Yet

Assessing GPT-4 Multimodal Performance in Radiological Image Analysis

A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical Imaging

Can GPT-4V(ision) Serve Medical Applications? Case Studies on GPT-4V for Multimodal Medical Diagnosis

Evaluation of GPT-4 for chest X-ray impression generation: A reader study on performance and perception

Exploring the Boundaries of GPT-4 in Radiology

A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical Image Analysis

Revolution or risk?—Assessing the potential and challenges of GPT-4V in radiologic image interpretation

Toward Foundation Models in Radiology? Quantitative Assessment of GPT-4V's Multimodal and Multianatomic Region Capabilities

Evaluating GPT-4o's Performance in the Official European Board of Radiology Exam: A Comprehensive Assessment

Generative pretrained transformer-4, an artificial intelligence text predictive model, has a high capability for passing novel written radiology exam questions

Evaluation of GPT Large Language Model Performance on RSNA 2023 Case of the Day Questions

Advancing radiology with GPT-4: Innovations in clinical applications, patient engagement, research, and learning

3D-CT-GPT: Generating 3D Radiology Reports through Integration of Large Vision-Language Models

Comparative analysis of GPT-4-based ChatGPT's diagnostic performance with radiologists using real-world radiology reports of brain tumors