Abstract:Importance Using artificial intelligence (AI) to help clinical diagnoses has been an active research topic for more than six decades. Few research however has the scale and accuracy that can be turned into clinical practice. The tide may be turned today with the power of large language models (LLMs). In this application, we evaluated the accuracy of medical license exam using the newly released Generative Pre-trained Transformer 4 with vision (GPT-4V), a large multimodal model trained to analyze image inputs with the text instructions from the user. This study is the first to evaluate GPTs for interpreting medical images. Objective This study aimed to evaluate the performance of GPT-4V on medical licensing examination questions with images, as well as to analyze interpretability. Design, Setting, and Participants We used 3 sets of multiple-choice questions with images to evaluate GPT-4V performance. The first set was the United States Medical Licensing Examination (USMLE) from the National Board of Medical Examiners (NBME) sample questions in step1, step2CK, and step3. The second set was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the user base. The third set was the Diagnostic Radiology Qualifying Core Exam (DRQCE) from the American Board of Radiology. The study (including data analysis) was conducted from September to October 2023. Main Outcomes and Measures The choice accuracy of GPT-4V was compared to two other large language models, GPT-4 and ChatGPT. The GPT-4V explanation was evaluated across 4 qualitative metrics: image misunderstanding, text hallucination, reasoning error, and non-medical error. Results Of the 3 exams with images, NBME, AMBOSS, and DRQCE, GPT-4V achieved accuracies of 86.2%, 62.0%, and 73.1%, respectively. GPT-4V outperformed ChatGPT and GPT-4 by 131.8% and 64.5% on average across various data sets. The model demonstrated a decreasing trend in performance as question difficulty increased in the AMBOSS dataset. GPT-4V achieves an accuracy of 90.7% in the full USMLE exam, outperforming the passing threshold of about 60% accuracy. Among the incorrect answers, 75.9% of responses included misinterpretation of the image. However, 39.0% of them could be easily solved with a short hint. Conclusion In this cross-sectional study, GPT-4V achieved a high accuracy of USMLE that was in the 70th - 80th percentile with AMBOSS users preparing for the exam. The results suggest the potential of GPT-4V for clinical decision support. However, GPT-4V generated explanation revealed several issues. It needs to improve explanation quality for potential use in clinical decision support.

GPT-4V(ision) Unsuitable for Clinical Care and Education: A Clinician-Evaluated Assessment

Can GPT-4V(ision) Serve Medical Applications? Case Studies on GPT-4V for Multimodal Medical Diagnosis

Capability of GPT-4V(ision) in Japanese National Medical Licensing Examination

Comparative Analysis of GPT-4Vision, GPT-4 and Open Source LLMs in Clinical Diagnostic Accuracy: A Benchmark Against Human Expertise

Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine

GPT-4 Vision: Multi-Modal Evolution of ChatGPT and Potential Role in Radiology

Evaluating General Vision-Language Models for Clinical Medicine

Assessing GPT-4 multimodal performance in radiological image analysis

Performance of Multimodal GPT-4V on USMLE with Image: Potential for Imaging Diagnostic Support with Explanations

Unveiling the Clinical Incapabilities: A Benchmarking Study of GPT-4V(ision) for Ophthalmic Multimodal Image Analysis

Multimodal ChatGPT for Medical Applications: an Experimental Study of GPT-4V

A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical Imaging

Holistic Evaluation of GPT-4V for Biomedical Imaging

Keeping Up With ChatGPT

A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical Image Analysis

Evaluating GPT-4 with Vision on Detection of Radiological Findings on Chest Radiographs

Revolution or risk?—Assessing the potential and challenges of GPT-4V in radiologic image interpretation

The virtual reference radiologist: comprehensive AI assistance for clinical image reading and interpretation

Capabilities of GPT-4 on Medical Challenge Problems