Comparing the Performance of ChatGPT and GPT-4 versus a Cohort of Medical Students on an Official University of Toronto Undergraduate Medical Education Progress Test

Christopher Meaney,Ryan S. Huang,Kevin Lu,Adam W. Fischer,Fok-Han Leung,Kulamakan Kulasegaram,Katina Tzanetos,Angela Punnett,Meaney,C.,Huang,R. S.,Lu,K.,Fischer,A. W.,Leung,F.-H.,Kulasegaram,K.,Tzanetos,K.,Punnett,A.
DOI: https://doi.org/10.1101/2023.09.14.23295571
2023-09-15
MedRxiv
Abstract:Background: Large language model (LLM) based chatbots have recently received broad social uptake; demonstrating remarkable abilities in natural language understanding, natural language generation, dialogue, and logic/reasoning. Objective: To compare the performance of two LLM-based chatbots, versus a cohort of medical students, on a University of Toronto undergraduate medical progress test. Methods: We report the mean number of correct responses, stratified by year of training/education, for each cohort of undergraduate medical students. We report counts/percentages of correctly answered test questions for each of ChatGPT and GPT-4. We compare the performance of ChatGPT versus GPT-4 using McNemar's test for dependent proportions. We compare whether the percentage of correctly answered test questions for ChatGPT or GPT-4 fall within/outside the confidence intervals for the mean number of correct responses for each of the cohorts of undergraduate medical education students. Results: A total of N=1057 University of Toronto undergraduate medical students completed the progress test during the Fall-2022 and Winter-2023 semesters. Student performance improved with increased training/education levels: UME-Year1 mean=36.3%; UME-Year2 mean=44.1%; UME-Year3 mean=52.2%; UME-Year4 mean=58.5%. ChatGPT answered 68/100 (68.0%) questions correctly; whereas, GPT-4 answered 79/100 (79.0%) questions correctly. GPT-4 performance was statistically significantly greater than ChatGPT (P=0.034). GPT-4 performed at a level equivalent to the top performing undergraduate medical student (79/100 questions correctly answered). Conclusions: This study adds to a growing body of literature demonstrating the remarkable performance of LLM-based chatbots on medical tests. GPT-4 performed at a level comparable to the best performing undergraduate medical student who attempted the progress test in 2022/2023. Future work will investigate the potential application of LLM-chatbots as tools for assisting learners/educators in medical education.
What problem does this paper attempt to address?