Abstract:Objectives: This study aims to assess the performance of a multimodal artificial intelligence (AI) model capable of analyzing both images and textual data (GPT-4V), in interpreting radiological images. It focuses on a range of modalities, anatomical regions, and pathologies to explore the potential of zero-shot generative AI in enhancing diagnostic processes in radiology. Methods: We analyzed 230 anonymized emergency room diagnostic images, consecutively collected over 1 week, using GPT-4V. Modalities included ultrasound (US), computerized tomography (CT), and X-ray images. The interpretations provided by GPT-4V were then compared with those of senior radiologists. This comparison aimed to evaluate the accuracy of GPT-4V in recognizing the imaging modality, anatomical region, and pathology present in the images. Results: GPT-4V identified the imaging modality correctly in 100% of cases (221/221), the anatomical region in 87.1% (189/217), and the pathology in 35.2% (76/216). However, the model's performance varied significantly across different modalities, with anatomical region identification accuracy ranging from 60.9% (39/64) in US images to 97% (98/101) and 100% (52/52) in CT and X-ray images (p < 0.001). Similarly, pathology identification ranged from 9.1% (6/66) in US images to 36.4% (36/99) in CT and 66.7% (34/51) in X-ray images (p < 0.001). These variations indicate inconsistencies in GPT-4V's ability to interpret radiological images accurately. Conclusion: While the integration of AI in radiology, exemplified by multimodal GPT-4, offers promising avenues for diagnostic enhancement, the current capabilities of GPT-4V are not yet reliable for interpreting radiological images. This study underscores the necessity for ongoing development to achieve dependable performance in radiology diagnostics. Clinical relevance statement: Although GPT-4V shows promise in radiological image interpretation, its high diagnostic hallucination rate (> 40%) indicates it cannot be trusted for clinical use as a standalone tool. Improvements are necessary to enhance its reliability and ensure patient safety. Key points: GPT-4V's capability in analyzing images offers new clinical possibilities in radiology. GPT-4V excels in identifying imaging modalities but demonstrates inconsistent anatomy and pathology detection. Ongoing AI advancements are necessary to enhance diagnostic reliability in radiological applications.

Evaluating multimodal AI in medical diagnostics

Assessing GPT-4 multimodal performance in radiological image analysis

Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine

Evaluating the Diagnostic Performance of Large Language Models in Identifying Complex Multisystemic Syndromes: A Comparative Study with Radiology Residents

Comparative Analysis of GPT-4Vision, GPT-4 and Open Source LLMs in Clinical Diagnostic Accuracy: A Benchmark Against Human Expertise

Can GPT-4V(ision) Serve Medical Applications? Case Studies on GPT-4V for Multimodal Medical Diagnosis

Multimodal Artificial Intelligence in Medicine

Human-AI collectives produce the most accurate differential diagnoses

Multimodal Foundation Models Exploit Text to Make Medical Image Predictions

The virtual reference radiologist: comprehensive AI assistance for clinical image reading and interpretation

Comparative Evaluation of AI Models Such as ChatGPT 3.5, ChatGPT 4.0, and Google Gemini in Neuroradiology Diagnostics

Embracing the future—is artificial intelligence already better? A comparative study of artificial intelligence performance in diagnostic accuracy and decision‐making

Rams, hounds and white boxes: Investigating human-AI collaboration protocols in medical diagnosis

MediConfusion: Can you trust your AI radiologist? Probing the reliability of multimodal medical foundation models

As artificial intelligence goes multimodal, medical applications multiply.

The future of multimodal artificial intelligence models for integrating imaging and clinical metadata: a narrative review

Vision–language AI assistance in human pathology

Evaluating LLM -- Generated Multimodal Diagnosis from Medical Images and Symptom Analysis

Visual-Textual Integration in LLMs for Medical Diagnosis: A Quantitative Analysis

Assessing the Capability of ChatGPT, Google Bard, and Microsoft Bing in Solving Radiology Case Vignettes