ChatGPT goes to Operating Room: Evaluating GPT-4 Performance and the Future Direction of Surgical Education and Training in the Era of Large Language Models

Oh,N.,Choi,G.-S.,Lee,W. Y.
DOI: https://doi.org/10.1101/2023.03.16.23287340
2023-03-18
MedRxiv
Abstract:Purpose This study aimed to assess the performance of ChatGPT, specifically the GPT-3.5 and GPT-4 models, on the Korean general surgery board exam questions and investigate the potential applications of large language models (LLM) for surgical education and training. Method The dataset comprised 280 questions from the Korean general surgery board exams conducted between 2020 and 2022. Both GPT-3.5 and GPT-4 models were evaluated, and their performance was compared using the chi-square test. Result GPT-3.5 achieved an overall accuracy of 46.8%, while GPT-4 demonstrated a significant improvement with an overall accuracy of 76.4%, indicating a notable difference in performance between the models (p
What problem does this paper attempt to address?