Abstract:Importance: Recent advancements in large language models (LLMs) have shown potential in a wide array of applications, including health care. While LLMs showed heterogeneous results across specialized medical board examinations, the performance of these models in neurology board examinations remains unexplored. Objective: To assess the performance of LLMs on neurology board-style examinations. Design, setting, and participants: This cross-sectional study was conducted between May 17 and May 31, 2023. The evaluation utilized a question bank approved by the American Board of Psychiatry and Neurology and was validated with a small question cohort by the European Board for Neurology. All questions were categorized into lower-order (recall, understanding) and higher-order (apply, analyze, synthesize) questions based on the Bloom taxonomy for learning and assessment. Performance by LLM ChatGPT versions 3.5 (LLM 1) and 4 (LLM 2) was assessed in relation to overall scores, question type, and topics, along with the confidence level and reproducibility of answers. Main outcomes and measures: Overall percentage scores of 2 LLMs. Results: LLM 2 significantly outperformed LLM 1 by correctly answering 1662 of 1956 questions (85.0%) vs 1306 questions (66.8%) for LLM 1. Notably, LLM 2's performance was greater than the mean human score of 73.8%, effectively achieving near-passing and passing grades in the neurology board examination. LLM 2 outperformed human users in behavioral, cognitive, and psychological-related questions and demonstrated superior performance to LLM 1 in 6 categories. Both LLMs performed better on lower-order than higher-order questions, with LLM 2 excelling in both lower-order and higher-order questions. Both models consistently used confident language, even when providing incorrect answers. Reproducible answers of both LLMs were associated with a higher percentage of correct answers than inconsistent answers. Conclusions and relevance: Despite the absence of neurology-specific training, LLM 2 demonstrated commendable performance, whereas LLM 1 performed slightly below the human average. While higher-order cognitive tasks were more challenging for both models, LLM 2's results were equivalent to passing grades in specialized neurology examinations. These findings suggest that LLMs could have significant applications in clinical neurology and health care with further refinements.

Evaluating the Performance of Large Language Models in Competitive Programming: A Multi-Year, Multi-Grade Analysis

Will code one day run a code? Performance of language models on ACEM primary examinations and implications

A Careful Examination of Large Language Model Performance on Grade School Arithmetic

Competition-Level Problems are Effective LLM Evaluators

Evaluating Language Models for Generating and Judging Programming Feedback

Evaluating Large Language Models with Grid-Based Game Competitions: An Extensible LLM Benchmark and Leaderboard

Performance of Large Language Models in a Computer Science Degree Program

An evaluation of LLM code generation capabilities through graded exercises

LLM4DS: Evaluating Large Language Models for Data Science Code Generation

CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models

Can Language Models Solve Olympiad Programming?

CLR-Bench: Evaluating Large Language Models in College-level Reasoning

L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models

Benchmarking Large Language Model (LLM) Performance for Game Playing via Tic-Tac-Toe

A Performance Study of LLM-Generated Code on Leetcode

Prompting Large Language Models to Tackle the Full Software Development Lifecycle: A Case Study

Evaluating the Performance of Large Language Models via Debates

Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games

Exploring the Responses of Large Language Models to Beginner Programmers' Help Requests

Performance of Large Language Models on a Neurology Board-Style Examination

Can Language Models Rival Mathematics Students? Evaluating Mathematical Reasoning through Textual Manipulation and Human Experiments