Comparative Assessment of Otolaryngology Knowledge Among Large Language Models

Dante J. Merlino,Santiago R. Brufau,George Saieed,Kathryn M. Van Abel,Daniel L. Price,David J. Archibald,Gregory A. Ator,Matthew L. Carlson
DOI: https://doi.org/10.1002/lary.31781
IF: 2.97
2024-09-23
The Laryngoscope
Abstract:This study assessed the baseline knowledge of advanced large language models (GPT‐3.5 and GPT‐4 by OpenAI; PaLM2 and MedPaLM by Google; LLama3:70b by Meta) in topics within otolaryngology—head and neck surgery, using a dataset of 4566 multiple choice, board‐style questions. The highest performing model, GPT‐4, correctly answered 77% of the time, while the lowest‐performing model, PaLM2, was correct on 56.5% of its responses; the free, open source model LLama3:70b correctly answered 66.8% of questions. Performance improved across models when asked to provide the reasoning behind their responses, with GPT‐4 correctly changing its incorrect answers to correct 31% of the time. Objective The purpose of this study was to evaluate the performance of advanced large language models from OpenAI (GPT‐3.5 and GPT‐4), Google (PaLM2 and MedPaLM), and an open source model from Meta (Llama3:70b) in answering clinical test multiple choice questions in the field of otolaryngology—head and neck surgery. Methods A dataset of 4566 otolaryngology questions was used; each model was provided a standardized prompt followed by a question. One hundred questions that were answered incorrectly by all models were further interrogated to gain insight into the causes of incorrect answers. Results GPT4 was the most accurate, correctly answering 3520 of 4566 questions (77.1%). MedPaLM correctly answered 3223 of 4566 (70.6%) questions, while llama3:70b, GPT3.5, and PaLM2 were correct on 3052 of 4566 (66.8%), 2672 of 4566 (58.5%), and 2583 of 4566 (56.5%) questions. Three hundred and sixty‐nine questions were answered incorrectly by all models. Prompts to provide reasoning improved accuracy in all models: GPT4 changed from incorrect to correct answer 31% of the time, while GPT3.5, Llama3, PaLM2, and MedPaLM corrected their responses 25%, 18%, 19%, and 17% of the time, respectively. Conclusion Large language models vary in their understanding of otolaryngology‐specific clinical knowledge. OpenAI's GPT4 has a strong understanding of core concepts as well as detailed information in the field of otolaryngology. Its baseline understanding in this field makes it well‐suited to serve in roles related to head and neck surgery education provided that the appropriate precautions are taken and potential limitations are understood. Level of Evidence N/A Laryngoscope, 2024
medicine, research & experimental,otorhinolaryngology
What problem does this paper attempt to address?