Assessment of the clinical knowledge of ChatGPT-4 in neonatal-perinatal medicine: a comparative analysis with ChatGPT-3.5

Puneet Sharma,Guangze Luo,Cindy Wang,Dara Brodsky,Camilia R. Martin,Andrew Beam,Kristyn Beam
DOI: https://doi.org/10.1038/s41372-024-01912-8
2024-02-25
Journal of Perinatology
Abstract:Large language models (LLMs) have demonstrated promising performance on clinical knowledge tasks, including the United States Medical Licensing Examination and subspecialty board examinations [1,2,3,4]. Previous work from our group assessed the performance of ChatGPT-3.5 on practice questions for the neonatal-perinatal medicine board examination and found that it performed below a passing rate [3]. Recent results, however, have shown that a new version, GPT-4, performs substantially better than GPT-3.5 on board questions in other medical fields [1]. Therefore, we conducted a comparative analysis of the performance of GPT-4 and GPT-3.5 on practice questions for the neonatal-perinatal medicine board examination. We compiled questions from a neonatal-perinatal medicine board examination preparation book and excluded questions that were non-multiple-choice format or had figures as GPT-3.5 does not support visual inputs [5]. This yielded 926 questions, which was sufficient to detect an effect size of 20% on paired t-test ( α = 0.05, β = 0.2). Each eligible question was entered into the application program interface using the same prompt for both versions. We instructed the LLM that it is the neonatal expert in the specific domain of the question and asked it to "take a deep breath and solve the following question." We used standard settings on both versions, including a temperature of zero.
pediatrics,obstetrics & gynecology
What problem does this paper attempt to address?