Abstract:Background Artificial intelligence has the potential to dramatically alter health care by enhancing how we diagnose and treat disease. One promising artificial intelligence model is ChatGPT, a general-purpose large language model trained by OpenAI. ChatGPT has shown human-level performance on several professional and academic benchmarks. We sought to evaluate its performance on surgical knowledge questions and assess the stability of this performance on repeat queries. Methods We evaluated the performance of ChatGPT-4 on questions from the Surgical Council on Resident Education question bank and a second commonly used surgical knowledge assessment, referred to as Data-B. Questions were entered in 2 formats: open-ended and multiple-choice. ChatGPT outputs were assessed for accuracy and insights by surgeon evaluators. We categorized reasons for model errors and the stability of performance on repeat queries. Results A total of 167 Surgical Council on Resident Education and 112 Data-B questions were presented to the ChatGPT interface. ChatGPT correctly answered 71.3% and 67.9% of multiple choice and 47.9% and 66.1% of open-ended questions for Surgical Council on Resident Education and Data-B, respectively. For both open-ended and multiple-choice questions, approximately two-thirds of ChatGPT responses contained nonobvious insights. Common reasons for incorrect responses included inaccurate information in a complex question ( n = 16, 36.4%), inaccurate information in a fact-based question ( n = 11, 25.0%), and accurate information with circumstantial discrepancy ( n = 6, 13.6%). Upon repeat query, the answer selected by ChatGPT varied for 36.4% of questions answered incorrectly on the first query; the response accuracy changed for 6/16 (37.5%) questions. Conclusion Consistent with findings in other academic and professional domains, we demonstrate near or above human-level performance of ChatGPT on surgical knowledge questions from 2 widely used question banks. ChatGPT performed better on multiple-choice than open-ended questions, prompting questions regarding its potential for clinical application. Unique to this study, we demonstrate inconsistency in ChatGPT responses on repeat queries. This finding warrants future consideration including efforts at training large language models to provide the safe and consistent responses required for clinical application. Despite near or above human-level performance on question banks and given these observations, it is unclear whether large language models such as ChatGPT are able to safely assist clinicians in providing care.

Assessment of Artificial Intelligence Performance on the Otolaryngology Residency In-Service Exam

Can ChatGPT help patients answer their otolaryngology questions?

Does ChatGPT Answer Otolaryngology Questions Accurately?

Comparative Performance of ChatGPT 3.5 and GPT4 on Rhinology Standardized Board Examination Questions

Assessing ChatGPT's Responses to Otolaryngology Patient Questions

Evaluating ChatGPT Performance on the Orthopaedic In-Training Examination

Evaluating the Current Ability of ChatGPT to Assist in Professional Otolaryngology Education

A Novel Evaluation Model for Assessing ChatGPT on Otolaryngology-Head and Neck Surgery Certification Examinations: Performance Study

Artificial Intelligence in Orthopaedics: Performance of ChatGPT on Text and Image Questions on a Complete AAOS Orthopaedic In-Training Examination (OITE)

Artificial Intelligence for Anesthesiology Board–Style Examination Questions: Role of Large Language Models

Comparison of Artificial Intelligence to Resident Performance on Upper-Extremity Orthopaedic In-Training Examination Questions

Performance of Two Artificial Intelligence Generative Language Models on the Orthopaedic In-Training Examination

Evaluating capabilities of large language models: Performance of GPT-4 on surgical knowledge assessments

Evaluating Capabilities of Large Language Models: Performance of GPT4 on Surgical Knowledge Assessments

Is ChatGPT smarter than Otolaryngology trainees? A comparison study of board style exam questions

Is ChatGPT 3.5 smarter than Otolaryngology trainees? A comparison study of board style exam questions

An Exploratory Analysis of ChatGPT Compared to Human Performance With the Anesthesiology Oral Board Examination: Initial Insights and Implications

The Accuracy of Artificial Intelligence ChatGPT in Oncology Examination Questions

[ChatGPT for use in technology-enhanced learning in anesthesiology and emergency medicine and potential clinical application of AI language models : Between hype and reality around artificial intelligence in medical use]

Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study

Can generative artificial intelligence pass the orthopaedic board examination?