Abstract:e13585 Background: The integration of Large Language Models (LLMs) into healthcare and medical education will represent a significant paradigm shift, offering transformative potential in how medical knowledge is accessed and assimilated. These models, however, have not yet been systematically trained, tested, or validated on complex medical information such as sub-specialty medical examinations. This study explores the performance of seven major LLMs in clinical radiation oncology using residency in-training exams. Methods: In this study, the 2021 American College of Radiology (ACR) Radiation Oncology In-Training Exam (TXIT) was used to evaluate the performance of various LLMs, including OpenAI's GPT-3.5-turbo, GPT-4, GPT-4-turbo, three Meta's Llama-2 models (7 billion, 13 billion, and 70 billion parameter), and Google's PaLM-2-text-bison. The ACR provided the publicly available national scoring for this exam. The exam comprised 298 questions across 13 domains, including clinical radiation oncology (195 questions, 65.4%). The exam was processed through each LLM via an application programming interface. LLM-generated answers were analyzed by clinical disease sites and compared to Radiation Oncology trainee performance and stratified by Post-Graduate Year (PGY) 2-5. Results: LLMs showed varied performance in the overall clinical radiation oncology domain, with OpenAI's GPT-4-turbo having the best performance with 68.0% correct answers, GPT-4 61.0%, GPT-3.5-turbo 48.0%, PaLM-2-text-bison 40.0%, and then the three Llama-2 models (70b 37.0%, 13b 38%, 7b 26%). GPT-4-turbo performed superiorly to lower-level (PGY2 51.6%, PGY3 61.6%) and comparably to upper-level radiation oncology trainees (PGY4 64.1%, PGY5 68.3%). Notably, GPT-4-turbo demonstrated 7.0% improvement over its predecessor GPT-4. LLMs scored the lowest in the gastrointestinal, genitourinary, and gynecology domains and highest in the bone and soft tissue, central nervous system and eye, and head, neck, and skin domains. Conclusions: GPT-4-turbo demonstrates clinical accuracy comparable to upper-level and superior to lower-level trainees in nearly all clinical domains. Conversely, Llama2 foundation models demonstrate overall worse performance than Level 1 (PGY2) trainees. Score discrepancies across disease site domains may be due to data availability, complexity of medical conditions, quality and quantity of training datasets, and interdisciplinary data inputs. Future research will need to evaluate the performance of models that are fine-tune trained in clinical oncology. This study also underscores the need for rigorous validation of LLM-generated information against established medical literature and expert consensus, necessitating expert oversight in their application in medical education and practice.

Performance of Large Language Models on a Neurology Board-Style Examination

Will code one day run a code? Performance of language models on ACEM primary examinations and implications

Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions

Performance of Publicly Available Large Language Models on Internal Medicine Board-style Questions

How Large Language Models Perform on the United States Medical Licensing Examination: A Systematic Review

Performance of Large Language Models on Medical Oncology Examination Questions

Evaluating the strengths and weaknesses of large language models in answering neurophysiology questions

Performance of large language models at the MRCS Part A: a tool for medical education?

Large language foundation models encode clinical radiation oncology domain knowledge: Performance on the American College of Radiology Standardized Examination.

Evaluating multiple large language models in pediatric ophthalmology

Evaluating Large Language Models in Ophthalmology

Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data

[Psychological consequences in organ transplantations].

Large Language Models in Pathology: A Comparative Study on Multiple Choice Question Performance with Pathology Trainees

Evaluating Large Language Models on a Highly-specialized Topic, Radiation Oncology Physics

Do Large Language Models have Shared Weaknesses in Medical Question Answering?

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

Environmental effects on vibrational proton dynamics in H5O2+: DFT study on crystalline H5O2+ClO4-.

Performance of Advanced Large Language Models (GPT-4o, GPT-4, Gemini 1.5 Pro, Claude 3 Opus) on Japanese Medical Licensing Examination: A Comparative Study

Large Language Models Take on Cardiothoracic Surgery: A Comparative Analysis of the Performance of Four Models on American Board of Thoracic Surgery Exam Questions in 2023

Artificial Intelligence in Medical Education: Comparative Analysis of ChatGPT, Bing, and Medical Students in Germany