Abstract:Background: Large language model (LLM)-based chatbots are evolving at an unprecedented pace with the release of ChatGPT, specifically GPT-3.5, and its successor, GPT-4. Their capabilities in general-purpose tasks and language generation have advanced to the point of performing excellently on various educational examination benchmarks, including medical knowledge tests. Comparing the performance of these 2 LLM models to that of Family Medicine residents on a multiple-choice medical knowledge test can provide insights into their potential as medical education tools. Objective: This study aimed to quantitatively and qualitatively compare the performance of GPT-3.5, GPT-4, and Family Medicine residents in a multiple-choice medical knowledge test appropriate for the level of a Family Medicine resident. Methods: An official University of Toronto Department of Family and Community Medicine Progress Test consisting of multiple-choice questions was inputted into GPT-3.5 and GPT-4. The artificial intelligence chatbot's responses were manually reviewed to determine the selected answer, response length, response time, provision of a rationale for the outputted response, and the root cause of all incorrect responses (classified into arithmetic, logical, and information errors). The performance of the artificial intelligence chatbots were compared against a cohort of Family Medicine residents who concurrently attempted the test. Results: GPT-4 performed significantly better compared to GPT-3.5 (difference 25.0%, 95% CI 16.3%-32.8%; McNemar test: P<.001); it correctly answered 89/108 (82.4%) questions, while GPT-3.5 answered 62/108 (57.4%) questions correctly. Further, GPT-4 scored higher across all 11 categories of Family Medicine knowledge. In 86.1% (n=93) of the responses, GPT-4 provided a rationale for why other multiple-choice options were not chosen compared to the 16.7% (n=18) achieved by GPT-3.5. Qualitatively, for both GPT-3.5 and GPT-4 responses, logical errors were the most common, while arithmetic errors were the least common. The average performance of Family Medicine residents was 56.9% (95% CI 56.2%-57.6%). The performance of GPT-3.5 was similar to that of the average Family Medicine resident (P=.16), while the performance of GPT-4 exceeded that of the top-performing Family Medicine resident (P<.001). Conclusions: GPT-4 significantly outperforms both GPT-3.5 and Family Medicine residents on a multiple-choice medical knowledge test designed for Family Medicine residents. GPT-4 provides a logical rationale for its response choice, ruling out other answer choices efficiently and with concise justification. Its high degree of accuracy and advanced reasoning capabilities facilitate its potential applications in medical education, including the creation of exam questions and scenarios as well as serving as a resource for medical knowledge or information on community services.

Early identification of Family Medicine residents at risk of failure using Natural Language Processing and Explainable Artificial Intelligence

Harnessing Natural Language Processing to Support Decisions Around Workplace-Based Assessment: Machine Learning Study of Competency-Based Medical Education

Extracting Family History of Patients from Clinical Narratives: Exploring an End-to-End Solution with Deep Learning Models.

Using Natural Language Processing and Machine Learning to Identify Internal Medicine-Pediatrics Residency Values in Applications

Identifying relations of medications with adverse drug events using recurrent convolutional neural networks and gradient boosting

Development and Evaluation of Machine Learning Models for the Detection of Emergency Department Patients with Opioid Misuse from Clinical Notes

Explainable Machine Learning Prediction for the Academic Performance of Deaf Scholars

Course Success Prediction and Early Identification of At-Risk Students Using Explainable Artificial Intelligence

Early prediction of medical students' performance in high-stakes examinations using machine learning approaches

Using Natural Language Processing to Screen Patients with Active Heart Failure: An Exploration for Hospital-wide Surveillance

Automated Identification of Heart Failure With Reduced Ejection Fraction Using Deep Learning-Based Natural Language Processing

Detecting of a Patient's Condition From Clinical Narratives Using Natural Language Representation

Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study

Leveraging Large Language Models through Natural Language Processing to provide interpretable Machine Learning predictions of mental deterioration in real time

Application of natural language processing to identify social needs from patient medical notes: development and assessment of a scalable, performant, and rule-based model in an integrated healthcare delivery system

Assessment of ChatGPT-4 in Family Medicine Board Examinations Using Advanced AI Learning and Analytical Methods: Observational Study

Artificial intelligence-based pathologic myopia identification system in the ophthalmology residency training program

Predicting students' academic progress and related attributes in first-year medical students: an analysis with artificial neural networks and Naïve Bayes

Automated Scoring of Clinical Patient Notes using Advanced NLP and Pseudo Labeling

Early Prediction of 30-day ICU Re-admissions Using Natural Language Processing and Machine Learning