The Comparative Diagnostic Capability of Large Language Models in Otolaryngology

Akshay Warrier,Rohan Singh,Afash Haleem,Haider Zaki,Jean Anderson Eloy
DOI: https://doi.org/10.1002/lary.31434
IF: 2.97
2024-04-03
The Laryngoscope
Abstract:Objectives Evaluate and compare the ability of large language models (LLMs) to diagnose various ailments in otolaryngology. Methods We collected all 100 clinical vignettes from the second edition of Otolaryngology Cases—The University of Cincinnati Clinical Portfolio by Pensak et al. With the addition of the prompt "Provide a diagnosis given the following history," we prompted ChatGPT‐3.5, Google Bard, and Bing‐GPT4 to provide a diagnosis for each vignette. These diagnoses were compared to the portfolio for accuracy and recorded. All queries were run in June 2023. Results ChatGPT‐3.5 was the most accurate model (89% success rate), followed by Google Bard (82%) and Bing GPT (74%). A chi‐squared test revealed a significant difference between the three LLMs in providing correct diagnoses (p = 0.023). Of the 100 vignettes, seven require additional testing results (i.e., biopsy, non‐contrast CT) for accurate clinical diagnosis. When omitting these vignettes, the revised success rates were 95.7% for ChatGPT‐3.5, 88.17% for Google Bard, and 78.72% for Bing‐GPT4 (p = 0.002). Conclusions ChatGPT‐3.5 offers the most accurate diagnoses when given established clinical vignettes as compared to Google Bard and Bing‐GPT4. LLMs may accurately offer assessments for common otolaryngology conditions but currently require detailed prompt information and critical supervision from clinicians. There is vast potential in the clinical applicability of LLMs; however, practitioners should be wary of possible "hallucinations" and misinformation in responses. Level of Evidence 3 Laryngoscope, 2024
medicine, research & experimental,otorhinolaryngology
What problem does this paper attempt to address?