Accuracy of large language models in answering ophthalmology board-style questions: A meta-analysis

Jo-Hsuan Wu,Takashi Nishida,T Y Alvin Liu
DOI: https://doi.org/10.1016/j.apjo.2024.100106
Abstract:Purpose: To evaluate the accuracy of large language models (LLMs) in answering ophthalmology board-style questions. Design: Meta-analysis. Methods: Literature search was conducted using PubMed and Embase in March 2024. We included full-length articles and research letters published in English that reported the accuracy of LLMs in answering ophthalmology board-style questions. Data on LLM performance, including the number of questions submitted and correct responses generated, were extracted for each question set from individual studies. Pooled accuracy was calculated using a random-effects model. Subgroup analyses were performed based on the LLMs used and specific ophthalmology topics assessed. Results: Among the 14 studies retrieved, 13 (93 %) tested LLMs on multiple ophthalmology topics. ChatGPT-3.5, ChatGPT-4, Bard, and Bing Chat were assessed in 12 (86 %), 11 (79 %), 4 (29 %), and 4 (29 %) studies, respectively. The overall pooled accuracy of LLMs was 0.65 (95 % CI: 0.61-0.69). Among the different LLMs, ChatGPT-4 achieved the highest pooled accuracy at 0.74 (95 % CI: 0.73-0.79), while ChatGPT-3.5 recorded the lowest at 0.52 (95 % CI: 0.51-0.54). LLMs performed best in "pathology" (0.78 [95 % CI: 0.70-0.86]) and worst in "fundamentals and principles of ophthalmology" (0.52 [95 % CI: 0.48-0.56]). Conclusions: The overall accuracy of LLMs in answering ophthalmology board-style questions was acceptable but not exceptional, with ChatGPT-4 and Bing Chat being top-performing models. Performance varied significantly based on specific ophthalmology topics tested. Inconsistent performances are of concern, highlighting the need for future studies to include ophthalmology board-style questions with images to more comprehensively examine the competency of LLMs.
What problem does this paper attempt to address?