Abstract:e13585 Background: The integration of Large Language Models (LLMs) into healthcare and medical education will represent a significant paradigm shift, offering transformative potential in how medical knowledge is accessed and assimilated. These models, however, have not yet been systematically trained, tested, or validated on complex medical information such as sub-specialty medical examinations. This study explores the performance of seven major LLMs in clinical radiation oncology using residency in-training exams. Methods: In this study, the 2021 American College of Radiology (ACR) Radiation Oncology In-Training Exam (TXIT) was used to evaluate the performance of various LLMs, including OpenAI's GPT-3.5-turbo, GPT-4, GPT-4-turbo, three Meta's Llama-2 models (7 billion, 13 billion, and 70 billion parameter), and Google's PaLM-2-text-bison. The ACR provided the publicly available national scoring for this exam. The exam comprised 298 questions across 13 domains, including clinical radiation oncology (195 questions, 65.4%). The exam was processed through each LLM via an application programming interface. LLM-generated answers were analyzed by clinical disease sites and compared to Radiation Oncology trainee performance and stratified by Post-Graduate Year (PGY) 2-5. Results: LLMs showed varied performance in the overall clinical radiation oncology domain, with OpenAI's GPT-4-turbo having the best performance with 68.0% correct answers, GPT-4 61.0%, GPT-3.5-turbo 48.0%, PaLM-2-text-bison 40.0%, and then the three Llama-2 models (70b 37.0%, 13b 38%, 7b 26%). GPT-4-turbo performed superiorly to lower-level (PGY2 51.6%, PGY3 61.6%) and comparably to upper-level radiation oncology trainees (PGY4 64.1%, PGY5 68.3%). Notably, GPT-4-turbo demonstrated 7.0% improvement over its predecessor GPT-4. LLMs scored the lowest in the gastrointestinal, genitourinary, and gynecology domains and highest in the bone and soft tissue, central nervous system and eye, and head, neck, and skin domains. Conclusions: GPT-4-turbo demonstrates clinical accuracy comparable to upper-level and superior to lower-level trainees in nearly all clinical domains. Conversely, Llama2 foundation models demonstrate overall worse performance than Level 1 (PGY2) trainees. Score discrepancies across disease site domains may be due to data availability, complexity of medical conditions, quality and quantity of training datasets, and interdisciplinary data inputs. Future research will need to evaluate the performance of models that are fine-tune trained in clinical oncology. This study also underscores the need for rigorous validation of LLM-generated information against established medical literature and expert consensus, necessitating expert oversight in their application in medical education and practice.

Will code one day run a code? Performance of language models on ACEM primary examinations and implications

Performance of large language models at the MRCS Part A: a tool for medical education?

Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions

The performance of large language models in intercollegiate Membership of the Royal College of Surgeons examination

Artificial Intelligence in Medical Education: Comparative Analysis of ChatGPT, Bing, and Medical Students in Germany

How Large Language Models Perform on the United States Medical Licensing Examination: A Systematic Review

Performance of Publicly Available Large Language Models on Internal Medicine Board-style Questions

Performance of Large Language Models on a Neurology Board-Style Examination

Large Language Models in Pathology: A Comparative Study on Multiple Choice Question Performance with Pathology Trainees

Performance of Language Models on the Family Medicine In-Training Exam

Large Language Models Take on Cardiothoracic Surgery: A Comparative Analysis of the Performance of Four Models on American Board of Thoracic Surgery Exam Questions in 2023

Performance of a Large Language Model on Japanese Emergency Medicine Board Certification Examinations

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

Large language foundation models encode clinical radiation oncology domain knowledge: Performance on the American College of Radiology Standardized Examination.

Large language models in pathology: A comparative study of ChatGPT and bard with pathology trainees on multiple-choice questions

Can large language models pass official high-grade exams of the European Society of Neuroradiology courses? A direct comparison between OpenAI chatGPT 3.5, OpenAI GPT4 and Google Bard

Large language models (LLMs) in radiology exams for medical students: Performance and consequences

Performance of Large Language Models (ChatGPT, Bing Search, and Google Bard) in Solving Case Vignettes in Physiology

Evaluating Large Language Models on a Highly-specialized Topic, Radiation Oncology Physics

Performance of Advanced Large Language Models (GPT-4o, GPT-4, Gemini 1.5 Pro, Claude 3 Opus) on Japanese Medical Licensing Examination: A Comparative Study

A Comparative Study of Open-Source Large Language Models, GPT-4 and Claude 2: Multiple-Choice Test Taking in Nephrology