Abstract:Background Artificial intelligence has the potential to dramatically alter health care by enhancing how we diagnose and treat disease. One promising artificial intelligence model is ChatGPT, a general-purpose large language model trained by OpenAI. ChatGPT has shown human-level performance on several professional and academic benchmarks. We sought to evaluate its performance on surgical knowledge questions and assess the stability of this performance on repeat queries. Methods We evaluated the performance of ChatGPT-4 on questions from the Surgical Council on Resident Education question bank and a second commonly used surgical knowledge assessment, referred to as Data-B. Questions were entered in 2 formats: open-ended and multiple-choice. ChatGPT outputs were assessed for accuracy and insights by surgeon evaluators. We categorized reasons for model errors and the stability of performance on repeat queries. Results A total of 167 Surgical Council on Resident Education and 112 Data-B questions were presented to the ChatGPT interface. ChatGPT correctly answered 71.3% and 67.9% of multiple choice and 47.9% and 66.1% of open-ended questions for Surgical Council on Resident Education and Data-B, respectively. For both open-ended and multiple-choice questions, approximately two-thirds of ChatGPT responses contained nonobvious insights. Common reasons for incorrect responses included inaccurate information in a complex question ( n = 16, 36.4%), inaccurate information in a fact-based question ( n = 11, 25.0%), and accurate information with circumstantial discrepancy ( n = 6, 13.6%). Upon repeat query, the answer selected by ChatGPT varied for 36.4% of questions answered incorrectly on the first query; the response accuracy changed for 6/16 (37.5%) questions. Conclusion Consistent with findings in other academic and professional domains, we demonstrate near or above human-level performance of ChatGPT on surgical knowledge questions from 2 widely used question banks. ChatGPT performed better on multiple-choice than open-ended questions, prompting questions regarding its potential for clinical application. Unique to this study, we demonstrate inconsistency in ChatGPT responses on repeat queries. This finding warrants future consideration including efforts at training large language models to provide the safe and consistent responses required for clinical application. Despite near or above human-level performance on question banks and given these observations, it is unclear whether large language models such as ChatGPT are able to safely assist clinicians in providing care.

Large Language Model Capabilities in Perioperative Risk Prediction and Prognostication

The Foundational Capabilities of Large Language Models in Predicting Postoperative Risks Using Clinical Notes

Critical Care Studies Using Large Language Models Based on Electronic Healthcare Records: A Technical Note

Evaluating the use of large language models to provide clinical recommendations in the Emergency Department

Investigating the clinical reasoning abilities of large language model GPT-4: an analysis of postoperative complications from renal surgeries

ClinicalGPT: Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation

Is larger always better? Evaluating and prompting large language models for non-generative medical tasks

New frontiers in the management of type 2 diabetes.

Evaluating Capabilities of Large Language Models: Performance of GPT4 on Surgical Knowledge Assessments

Large Language Models in Medical Term Classification and Unexpected Misalignment Between Response and Reasoning

Large Language Models Take on Cardiothoracic Surgery: A Comparative Analysis of the Performance of Four Models on American Board of Thoracic Surgery Exam Questions in 2023

On the limitations of large language models in clinical diagnosis

Evaluation of General Large Language Models in Contextually Assessing Semantic Concepts Extracted from Adult Critical Care Electronic Health Record Notes

Evaluating capabilities of large language models: Performance of GPT-4 on surgical knowledge assessments

Performance of trauma-trained large language models on surgical assessment questions: A new approach in resource identification

Predictive and Explainable Analysis of Post-operative Acute Kidney Injury in Children undergoing Cardiopulmonary Bypass: An Application of Large Language Models

Transformative potential of Large Language Models in data mining on Electronic Health Records.

Can large language models reason about medical questions?

Uncertainty Quantification for Clinical Outcome Predictions with (Large) Language Models

O87: Stratified Evaluation of Large Language Model GPT-4’s Question-Answering In Surgery reveals AI Knowledge Gaps

Large Language Models Like ChatGPT Show Promise, but Clinical Use of Artificial Intelligence Requires Physician Partnership to Enable Patient Care, Minimize Administrative Burden, Maximize Efficiency, and Minimize Risk