Comparison of Artificial Intelligence to Resident Performance on Upper-Extremity Orthopaedic In-Training Examination Questions

Yagiz Ozdag,Daniel S Hayes,Gabriel S Makar,Shahid Manzar,Brian K Foster,Mason J Shultz,Joel C Klena,Louis C Grandizio
DOI: https://doi.org/10.1016/j.jhsg.2023.10.013
2023-12-11
Abstract:Purpose: Currently, there is a paucity of prior investigations and studies examining applications for artificial intelligence (AI) in upper-extremity (UE) surgical education. The purpose of this investigation was to assess the performance of a novel AI tool (ChatGPT) on UE questions on the Orthopaedic In-Training Examination (OITE). We aimed to compare the performance of ChatGPT to the examination performance of hand surgery residents. Methods: We selected questions from the 2020-2022 OITEs that focused on both the hand and UE as well as the shoulder and elbow content domains. These questions were divided into two categories: those with text-only prompts (text-only questions) and those that included supplementary images or videos (media questions). Two authors (B.K.F. and G.S.M.) converted the accompanying media into text-based descriptions. Included questions were inputted into ChatGPT (version 3.5) to generate responses. Each OITE question was entered into ChatGPT three times: (1) open-ended response, which requested a free-text response; (2) multiple-choice responses without asking for justification; and (3) multiple-choice response with justification. We referred to the OITE scoring guide for each year in order to compare the percentage of correct AI responses to correct resident responses. Results: A total of 102 UE OITE questions were included; 59 were text-only questions, and 43 were media-based. ChatGPT correctly answered 46 (45%) of 102 questions using the Multiple Choice No Justification prompt requirement (42% for text-based and 44% for media questions). Compared to ChatGPT, postgraduate year 1 orthopaedic residents achieved an average score of 51% correct. Postgraduate year 5 residents answered 76% of the same questions correctly. Conclusions: ChatGPT answered fewer UE OITE questions correctly compared to hand surgery residents of all training levels. Clinical relevance: Further development of novel AI tools may be necessary if this technology is going to have a role in UE education.
What problem does this paper attempt to address?