Abstract:We present the first study to investigate Large Language Models (LLMs) in answering radiation oncology physics questions. Because popular exams like AP Physics, LSAT, and GRE have large test-taker populations and ample test preparation resources in circulation, they may not allow for accurately assessing the true potential of LLMs. This paper proposes evaluating LLMs on a highly-specialized topic, radiation oncology physics, which may be more pertinent to scientific and medical communities in addition to being a valuable benchmark of LLMs. We developed an exam consisting of 100 radiation oncology physics questions based on our expertise at Mayo Clinic. Four LLMs, ChatGPT (GPT-3.5), ChatGPT (GPT-4), Bard (LaMDA), and BLOOMZ, were evaluated against medical physicists and non-experts. ChatGPT (GPT-4) outperformed all other LLMs as well as medical physicists, on average. The performance of ChatGPT (GPT-4) was further improved when prompted to explain first, then answer. ChatGPT (GPT-3.5 and GPT-4) showed a high level of consistency in its answer choices across a number of trials, whether correct or incorrect, a characteristic that was not observed in the human test groups. In evaluating ChatGPTs (GPT-4) deductive reasoning ability using a novel approach (substituting the correct answer with "None of the above choices is the correct answer."), ChatGPT (GPT-4) demonstrated surprising accuracy, suggesting the potential presence of an emergent ability. Finally, although ChatGPT (GPT-4) performed well overall, its intrinsic properties did not allow for further improvement when scoring based on a majority vote across trials. In contrast, a team of medical physicists were able to greatly outperform ChatGPT (GPT-4) using a majority vote. This study suggests a great potential for LLMs to work alongside radiation oncology experts as highly knowledgeable assistants.

Efficiently Measuring the Cognitive Ability of LLMs: an Adaptive Testing Perspective

CogLM: Tracking Cognitive Development of Large Language Models

LLM4DS: Evaluating Large Language Models for Data Science Code Generation

TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs

Beyond ChatGPT: Enhancing Software Quality Assurance Tasks with Diverse LLMs and Validation Techniques

Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench

Enhancing Large Language Model Performance To Answer Questions and Extract Information More Accurately

How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments

Evaluating Large Language Models on a Highly-specialized Topic, Radiation Oncology Physics

From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation

LatEval: An Interactive LLMs Evaluation Benchmark with Incomplete Information from Lateral Thinking Puzzles

CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy

AGIBench: A Multi-granularity, Multimodal, Human-referenced, Auto-scoring Benchmark for Large Language Models

Towards Explainable Computerized Adaptive Testing with Large Language Model

Evaluating the Performance of Large Language Models on GAOKAO Benchmark

CogGPT: Unleashing the Power of Cognitive Dynamics on Large Language Models

What is the best model? Application-driven Evaluation for Large Language Models

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks

Applying Large Language Models and Chain-of-Thought for Automatic Scoring

Harnessing LLMs for multi-dimensional writing assessment: Reliability and alignment with human judgments

Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition