Evaluating the strengths and weaknesses of large language models in answering neurophysiology questions

Hassan Shojaee-Mend,Reza Mohebbati,Mostafa Amiri,Alireza Atarodi

DOI: https://doi.org/10.1038/s41598-024-60405-y

IF: 4.6

2024-05-12

Scientific Reports

Abstract:Large language models (LLMs), like ChatGPT, Google's Bard, and Anthropic's Claude, showcase remarkable natural language processing capabilities. Evaluating their proficiency in specialized domains such as neurophysiology is crucial in understanding their utility in research, education, and clinical applications. This study aims to assess and compare the effectiveness of Large Language Models (LLMs) in answering neurophysiology questions in both English and Persian (Farsi) covering a range of topics and cognitive levels. Twenty questions covering four topics (general, sensory system, motor system, and integrative) and two cognitive levels (lower-order and higher-order) were posed to the LLMs. Physiologists scored the essay-style answers on a scale of 0–5 points. Statistical analysis compared the scores across different levels such as model, language, topic, and cognitive levels. Performing qualitative analysis identified reasoning gaps. In general, the models demonstrated good performance (mean score = 3.87/5), with no significant difference between language or cognitive levels. The performance was the strongest in the motor system (mean = 4.41) while the weakest was observed in integrative topics (mean = 3.35). Detailed qualitative analysis uncovered deficiencies in reasoning, discerning priorities, and knowledge integrating. This study offers valuable insights into LLMs' capabilities and limitations in the field of neurophysiology. The models demonstrate proficiency in general questions but face challenges in advanced reasoning and knowledge integration. Targeted training could address gaps in knowledge and causal reasoning. As LLMs evolve, rigorous domain-specific assessments will be crucial for evaluating advancements in their performance.

multidisciplinary sciences

What problem does this paper attempt to address?

This paper aims to evaluate the effectiveness and limitations of large - language models (LLMs) in answering neurophysiology questions. Specifically, the researchers selected three popular large - language models - ChatGPT, Google's Bard, and Anthropic's Claude - to test their ability to answer neurophysiology questions in English and Persian (Farsi). These questions cover four topics (general, sensory system, motor system, and integrative system) and two cognitive levels (low - order and high - order), and are classified using the Bloom taxonomy. The main objectives of the study include: 1. **Evaluating the overall performance of the models**: By having the three language models answer a series of neurophysiology questions and having physiologists score the quality of the answers, to evaluate the performance of these models in dealing with questions in the professional field. 2. **Comparing the differences between different models**: Analyzing the performance differences between different models when answering the same questions, in order to understand which models perform better on specific types of questions. 3. **Exploring the influence of language and cognitive level**: Evaluating the influence of language (English vs Persian) and cognitive level (low - order vs high - order) on the performance of the models. 4. **Identifying the weaknesses of the models**: Through qualitative analysis, finding out problems such as insufficient logical reasoning, misjudgment of priorities, and poor knowledge integration ability when the models answer certain questions. Overall, this study not only provides insights into the effectiveness of large - language models in the field of neurophysiology, but also points out the limitations of these models in high - level reasoning and knowledge integration, providing directions for future research and model improvement.

Evaluating the strengths and weaknesses of large language models in answering neurophysiology questions

Performance of Large Language Models on a Neurology Board-Style Examination

Performance of Large Language Models (ChatGPT, Bing Search, and Google Bard) in Solving Case Vignettes in Physiology

Evaluation of large language model performance on the Biomedical Language Understanding and Reasoning Benchmark

Evaluating the Potential of Leading Large Language Models in Reasoning Biology Questions

SemioLLM: Assessing Large Language Models for Semiological Analysis in Epilepsy Research

Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data

Can large language models reason about medical questions?

Evaluating Large Language Models on a Highly-specialized Topic, Radiation Oncology Physics

Large language models encode clinical knowledge

Evaluating Large Language Models in Ophthalmology

Large Language Models in Pathology: A Comparative Study on Multiple Choice Question Performance with Pathology Trainees

Do Large Language Models have Shared Weaknesses in Medical Question Answering?

Large language models in pathology: A comparative study of ChatGPT and bard with pathology trainees on multiple-choice questions

Evaluating large language models on medical, lay language, and self-reported descriptions of genetic conditions

Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study

The Pulse of Artificial Intelligence in Cardiology: A Comprehensive Evaluation of State-of-the-art Large Language Models for Potential Use in Clinical Cardiology

Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions

Performance of large language models at the MRCS Part A: a tool for medical education?

A Comprehensive Evaluation of Large Language Models on Mental Illnesses

Evaluation of General Large Language Models in Contextually Assessing Semantic Concepts Extracted from Adult Critical Care Electronic Health Record Notes