Performance Assessment of an Artificial Intelligence Chatbot in Clinical Vitreoretinal Scenarios

Michael J. Maywood,Ravi Parikh,Avnish Deobhakta,Tedi Begaj
DOI: https://doi.org/10.1097/iae.0000000000004053
2024-01-23
Retina
Abstract:Purpose: To determine how often ChatGPT is able to provide accurate and comprehensive information regarding clinical vitreoretinal scenarios. To assess the types of sources ChatGPT primarily utilizes and to determine if they are hallucinated. Methods: A retrospective cross-sectional study. We designed 40 open-ended clinical scenarios across 4 main topics in vitreoretinal disease. Responses were graded on correctness and comprehensiveness by two blinded retina specialists. The primary outcome was the number of clinical scenarios that ChatGPT answered correctly and comprehensively. Secondary outcomes included: theoretical harm to patients, the distribution of the type of references utilized by the chatbot, and the frequency of hallucinated references. Results: In June 2023, ChatGPT answered 83% (33/40) of clinical scenarios correctly but provided a comprehensive answer in only 52.5% (21/40) of cases. Subgroup analysis demonstrated an average correct score of 86.7% in nAMD, 100% in DR, 76.7% in retinal vascular disease and 70% in the surgical domain. There were 6 incorrect responses with 1 (16.7%) case of no harm, 3 (50%) cases of possible harm and 2 (33.3%) cases of definitive harm. Conclusion: ChatGPT correctly answered more than 80% of complex open-ended vitreoretinal clinical scenarios, with a reduced capability to provide a comprehensive response.
ophthalmology
What problem does this paper attempt to address?