Abstract:Importance Using artificial intelligence (AI) to help clinical diagnoses has been an active research topic for more than six decades. Few research however has the scale and accuracy that can be turned into clinical practice. The tide may be turned today with the power of large language models (LLMs). In this application, we evaluated the accuracy of medical license exam using the newly released Generative Pre-trained Transformer 4 with vision (GPT-4V), a large multimodal model trained to analyze image inputs with the text instructions from the user. This study is the first to evaluate GPTs for interpreting medical images. Objective This study aimed to evaluate the performance of GPT-4V on medical licensing examination questions with images, as well as to analyze interpretability. Design, Setting, and Participants We used 3 sets of multiple-choice questions with images to evaluate GPT-4V performance. The first set was the United States Medical Licensing Examination (USMLE) from the National Board of Medical Examiners (NBME) sample questions in step1, step2CK, and step3. The second set was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the user base. The third set was the Diagnostic Radiology Qualifying Core Exam (DRQCE) from the American Board of Radiology. The study (including data analysis) was conducted from September to October 2023. Main Outcomes and Measures The choice accuracy of GPT-4V was compared to two other large language models, GPT-4 and ChatGPT. The GPT-4V explanation was evaluated across 4 qualitative metrics: image misunderstanding, text hallucination, reasoning error, and non-medical error. Results Of the 3 exams with images, NBME, AMBOSS, and DRQCE, GPT-4V achieved accuracies of 86.2%, 62.0%, and 73.1%, respectively. GPT-4V outperformed ChatGPT and GPT-4 by 131.8% and 64.5% on average across various data sets. The model demonstrated a decreasing trend in performance as question difficulty increased in the AMBOSS dataset. GPT-4V achieves an accuracy of 90.7% in the full USMLE exam, outperforming the passing threshold of about 60% accuracy. Among the incorrect answers, 75.9% of responses included misinterpretation of the image. However, 39.0% of them could be easily solved with a short hint. Conclusion In this cross-sectional study, GPT-4V achieved a high accuracy of USMLE that was in the 70th - 80th percentile with AMBOSS users preparing for the exam. The results suggest the potential of GPT-4V for clinical decision support. However, GPT-4V generated explanation revealed several issues. It needs to improve explanation quality for potential use in clinical decision support.

Diagnostic Accuracy of GPT Multimodal Analysis on USMLE Questions Including Text and Visuals

Capability of GPT-4V(ision) in Japanese National Medical Licensing Examination

Performance of Multimodal GPT-4V on USMLE with Image: Potential for Imaging Diagnostic Support with Explanations

Evaluating General Vision-Language Models for Clinical Medicine

Visual-Textual Integration in LLMs for Medical Diagnosis: A Quantitative Analysis

Multimodal ChatGPT for Medical Applications: an Experimental Study of GPT-4V

Evaluating LLM -- Generated Multimodal Diagnosis from Medical Images and Symptom Analysis

Evaluating the Performance of ChatGPT-4o Vision Capabilities on Image-Based USMLE Step 1, Step 2, and Step 3 Examination Questions

Critical Analysis of ChatGPT 4 Omni in USMLE Disciplines, Clinical Clerkships, and Clinical Skills

Comparative Analysis of GPT-4Vision, GPT-4 and Open Source LLMs in Clinical Diagnostic Accuracy: A Benchmark Against Human Expertise

Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine

Step into the era of large multimodal models: a pilot study on ChatGPT-4V(ision)'s ability to interpret radiological images

Assessing GPT-4 Multimodal Performance in Radiological Image Analysis

Capabilities of GPT-4 on Medical Challenge Problems

How Large Language Models Perform on the United States Medical Licensing Examination: A Systematic Review

Large language models (LLMs) in radiology exams for medical students: Performance and consequences

Can GPT-4V(ision) Serve Medical Applications? Case Studies on GPT-4V for Multimodal Medical Diagnosis

Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study

A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical Image Analysis

How Well Does ChatGPT Do When Taking the Medical Licensing Exams? The Implications of Large Language Models for Medical Education and Knowledge Assessment