Abstract:e13585 Background: The integration of Large Language Models (LLMs) into healthcare and medical education will represent a significant paradigm shift, offering transformative potential in how medical knowledge is accessed and assimilated. These models, however, have not yet been systematically trained, tested, or validated on complex medical information such as sub-specialty medical examinations. This study explores the performance of seven major LLMs in clinical radiation oncology using residency in-training exams. Methods: In this study, the 2021 American College of Radiology (ACR) Radiation Oncology In-Training Exam (TXIT) was used to evaluate the performance of various LLMs, including OpenAI's GPT-3.5-turbo, GPT-4, GPT-4-turbo, three Meta's Llama-2 models (7 billion, 13 billion, and 70 billion parameter), and Google's PaLM-2-text-bison. The ACR provided the publicly available national scoring for this exam. The exam comprised 298 questions across 13 domains, including clinical radiation oncology (195 questions, 65.4%). The exam was processed through each LLM via an application programming interface. LLM-generated answers were analyzed by clinical disease sites and compared to Radiation Oncology trainee performance and stratified by Post-Graduate Year (PGY) 2-5. Results: LLMs showed varied performance in the overall clinical radiation oncology domain, with OpenAI's GPT-4-turbo having the best performance with 68.0% correct answers, GPT-4 61.0%, GPT-3.5-turbo 48.0%, PaLM-2-text-bison 40.0%, and then the three Llama-2 models (70b 37.0%, 13b 38%, 7b 26%). GPT-4-turbo performed superiorly to lower-level (PGY2 51.6%, PGY3 61.6%) and comparably to upper-level radiation oncology trainees (PGY4 64.1%, PGY5 68.3%). Notably, GPT-4-turbo demonstrated 7.0% improvement over its predecessor GPT-4. LLMs scored the lowest in the gastrointestinal, genitourinary, and gynecology domains and highest in the bone and soft tissue, central nervous system and eye, and head, neck, and skin domains. Conclusions: GPT-4-turbo demonstrates clinical accuracy comparable to upper-level and superior to lower-level trainees in nearly all clinical domains. Conversely, Llama2 foundation models demonstrate overall worse performance than Level 1 (PGY2) trainees. Score discrepancies across disease site domains may be due to data availability, complexity of medical conditions, quality and quantity of training datasets, and interdisciplinary data inputs. Future research will need to evaluate the performance of models that are fine-tune trained in clinical oncology. This study also underscores the need for rigorous validation of LLM-generated information against established medical literature and expert consensus, necessitating expert oversight in their application in medical education and practice.

Performance of Large Language Models in Technical MRI Question Answering: A Comparative Study

Evaluating AI Proficiency in Nuclear Cardiology: Large Language Models take on the Board Preparation Exam

Performance of Large Language Models on a Neurology Board-Style Examination

Performance of an Open-Source Large Language Model in Extracting Information from Free-Text Radiology Reports

A Comparative Study of Open-Source Large Language Models, GPT-4 and Claude 2: Multiple-Choice Test Taking in Nephrology

Large language foundation models encode clinical radiation oncology domain knowledge: Performance on the American College of Radiology Standardized Examination.

Nimg-67. The Current State Of Large Language Models And Neuro-Oncology Imaging

Performance of Open-Source LLMs in Challenging Radiological Cases — A Benchmark Study on 1,933 Eurorad Case Reports

Can large language models reason about medical questions?

Evaluating Large Language Models in Ophthalmology

O87: Stratified Evaluation of Large Language Model GPT-4’s Question-Answering In Surgery reveals AI Knowledge Gaps

Human-AI Collaboration in Large Language Model-Assisted Brain MRI Differential Diagnosis: A Usability Study

Performance of large language models at the MRCS Part A: a tool for medical education?

["Alternative" therapy methods in functional disorders of the gastrointestinal system].

[Psychological consequences in organ transplantations].

The Diagnostic Performance of Large Language Models and General Radiologists in Thoracic Radiology Cases: A Comparative Study

Performance of large language models in numerical vs. semantic medical knowledge: Benchmarking on evidence-based Q&As

Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data

Large language models (LLMs) in radiology exams for medical students: Performance and consequences

Evaluation of large language models performance against humans for summarizing MRI knee radiology reports: A feasibility study

Evaluating the strengths and weaknesses of large language models in answering neurophysiology questions