Diagnostic Accuracy of GPT Multimodal Analysis on USMLE Questions Including Text and Visuals

Vera Sorin,Benjamin S. Glicksberg,Yiftach Barash,Eli Konen,Girish Nadkarni,Eyal Klang,Benjamin S Glicksberg
DOI: https://doi.org/10.1101/2023.10.29.23297733
2023-11-01
MedRxiv
Abstract:Objective: Large Language Models (LLMs) have demonstrated proficiency in free-text analysis in healthcare. With recent advancements, GPT-4 now has the capability to analyze both text and accompanying images. The aim of this study was to evaluate the performance of the multimodal GPT-4 in analyzing medical images using USMLE questions that incorporate visuals. Methods: We analyzed GPT-4's performance on 55 USMLE sample questions across the three steps. In separate chat instances we provided the model with each question both with and without the images. We calculated accuracy with and without the images provided. Results: GPT-4 achieved an accuracy of 80.0% with images and 65.0% without. No cases existed where the model answered correctly without images and incorrectly with them. Performance varied across USMLE steps and was significantly better for questions with figures compared to graphs. Conclusion: GPT-4 demonstrated an ability to analyze medical images from USMLE questions, including graphs and figures. A multimodal LLM in healthcare could potentially accelerate both patient care and research, by integrating visual data and text in analysis processes.
What problem does this paper attempt to address?