Abstract:Importance Using artificial intelligence (AI) to help clinical diagnoses has been an active research topic for more than six decades. Few research however has the scale and accuracy that can be turned into clinical practice. The tide may be turned today with the power of large language models (LLMs). In this application, we evaluated the accuracy of medical license exam using the newly released Generative Pre-trained Transformer 4 with vision (GPT-4V), a large multimodal model trained to analyze image inputs with the text instructions from the user. This study is the first to evaluate GPTs for interpreting medical images. Objective This study aimed to evaluate the performance of GPT-4V on medical licensing examination questions with images, as well as to analyze interpretability. Design, Setting, and Participants We used 3 sets of multiple-choice questions with images to evaluate GPT-4V performance. The first set was the United States Medical Licensing Examination (USMLE) from the National Board of Medical Examiners (NBME) sample questions in step1, step2CK, and step3. The second set was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the user base. The third set was the Diagnostic Radiology Qualifying Core Exam (DRQCE) from the American Board of Radiology. The study (including data analysis) was conducted from September to October 2023. Main Outcomes and Measures The choice accuracy of GPT-4V was compared to two other large language models, GPT-4 and ChatGPT. The GPT-4V explanation was evaluated across 4 qualitative metrics: image misunderstanding, text hallucination, reasoning error, and non-medical error. Results Of the 3 exams with images, NBME, AMBOSS, and DRQCE, GPT-4V achieved accuracies of 86.2%, 62.0%, and 73.1%, respectively. GPT-4V outperformed ChatGPT and GPT-4 by 131.8% and 64.5% on average across various data sets. The model demonstrated a decreasing trend in performance as question difficulty increased in the AMBOSS dataset. GPT-4V achieves an accuracy of 90.7% in the full USMLE exam, outperforming the passing threshold of about 60% accuracy. Among the incorrect answers, 75.9% of responses included misinterpretation of the image. However, 39.0% of them could be easily solved with a short hint. Conclusion In this cross-sectional study, GPT-4V achieved a high accuracy of USMLE that was in the 70th - 80th percentile with AMBOSS users preparing for the exam. The results suggest the potential of GPT-4V for clinical decision support. However, GPT-4V generated explanation revealed several issues. It needs to improve explanation quality for potential use in clinical decision support.

Assessing GPT-4 multimodal performance in radiological image analysis

Assessing GPT-4 Multimodal Performance in Radiological Image Analysis

The virtual reference radiologist: comprehensive AI assistance for clinical image reading and interpretation

GPT-4 Vision: Multi-Modal Evolution of ChatGPT and Potential Role in Radiology

Can GPT-4V(ision) Serve Medical Applications? Case Studies on GPT-4V for Multimodal Medical Diagnosis

A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical Image Analysis

A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical Imaging

Holistic Evaluation of GPT-4V for Biomedical Imaging

GPT-4V(ision) Unsuitable for Clinical Care and Education: A Clinician-Evaluated Assessment

Comparative Analysis of GPT-4Vision, GPT-4 and Open Source LLMs in Clinical Diagnostic Accuracy: A Benchmark Against Human Expertise

Evaluation of GPT-4 for chest X-ray impression generation: A reader study on performance and perception

Evaluation of GPT Large Language Model Performance on RSNA 2023 Case of the Day Questions

Advancing radiology with GPT-4: Innovations in clinical applications, patient engagement, research, and learning

Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine

Toward Foundation Models in Radiology? Quantitative Assessment of GPT-4V's Multimodal and Multianatomic Region Capabilities

Performance of Multimodal GPT-4V on USMLE with Image: Potential for Imaging Diagnostic Support with Explanations

Generative pre-trained transformer (GPT)-4 support for differential diagnosis in neuroradiology

Evaluating the Diagnostic Performance of Large Language Models in Identifying Complex Multisystemic Syndromes: A Comparative Study with Radiology Residents

Evaluation of GPT-4 ability to identify and generate patient instructions for actionable incidental radiology findings

GPT-4V Cannot Generate Radiology Reports Yet

Step into the era of large multimodal models: a pilot study on ChatGPT-4V(ision)'s ability to interpret radiological images