In-depth analysis of ChatGPT's performance based on specific signaling words and phrases in the question stem of 2377 USMLE step 1 style questions

Leonard Knoedler,Samuel Knoedler,Cosima C. Hoch,Lukas Prantl,Konstantin Frank,Laura Soiderer,Sebastian Cotofana,Amir H. Dorafshar,Thilo Schenck,Felix Vollbach,Giuseppe Sofo,Michael Alfertshofer

DOI: https://doi.org/10.1038/s41598-024-63997-7

IF: 4.6

2024-06-13

Scientific Reports

Abstract:ChatGPT has garnered attention as a multifaceted AI chatbot with potential applications in medicine. Despite intriguing preliminary findings in areas such as clinical management and patient education, there remains a substantial knowledge gap in comprehensively understanding the chances and limitations of ChatGPT's capabilities, especially in medical test-taking and education. A total of n = 2,729 USMLE Step 1 practice questions were extracted from the Amboss question bank. After excluding 352 image-based questions, a total of 2,377 text-based questions were further categorized and entered manually into ChatGPT, and its responses were recorded. ChatGPT's overall performance was analyzed based on question difficulty, category, and content with regards to specific signal words and phrases. ChatGPT achieved an overall accuracy rate of 55.8% in a total number of n = 2,377 USMLE Step 1 preparation questions obtained from the Amboss online question bank. It demonstrated a significant inverse correlation between question difficulty and performance with r s = -0.306; p < 0.001, maintaining comparable accuracy to the human user peer group across different levels of question difficulty. Notably, ChatGPT outperformed in serology-related questions (61.1% vs. 53.8%; p = 0.005) but struggled with ECG-related content (42.9% vs. 55.6%; p = 0.021). ChatGPT achieved statistically significant worse performances in pathophysiology-related question stems. (Signal phrase = "what is the most likely/probable cause"). ChatGPT performed consistent across various question categories and difficulty levels. These findings emphasize the need for further investigations to explore the potential and limitations of ChatGPT in medical examination and education.

multidisciplinary sciences

What problem does this paper attempt to address?

The paper attempts to address the issue of evaluating the performance of the AI chatbot ChatGPT in the USMLE (United States Medical Licensing Examination) Step 1 exam. Specifically, the researchers extracted 2,377 text-based USMLE Step 1 practice questions from the Amboss question bank and had ChatGPT answer them to analyze its overall accuracy, performance at different difficulty levels, and its ability to respond to specific signal words and phrases. The main objectives include: 1. **Evaluating ChatGPT's overall performance on USMLE Step 1 questions**: The researchers found that ChatGPT's overall accuracy was 55.8%, close to the 60% passing threshold. 2. **Analyzing performance at different difficulty levels**: ChatGPT's performance was significantly negatively correlated with question difficulty (rs = -0.306; p < 0.001), meaning the higher the difficulty, the worse the performance. 3. **Exploring the impact of specific signal words and phrases**: The study found that ChatGPT performed better on serology-related questions (61.1% vs. 53.8%) but worse on electrocardiogram (ECG) related questions (42.9% vs. 55.6%). 4. **Comparing performance across different medical specialties**: ChatGPT performed best in the behavioral health category (77.9%) and worst in the cardiovascular system category (44.7%). 5. **Investigating the impact of age groups on performance**: ChatGPT performed better in questions related to younger patient groups compared to older patient groups. Through these analyses, the study aims to gain a deeper understanding of ChatGPT's potential and limitations in medical exams and education, providing a reference for future research.

In-depth analysis of ChatGPT's performance based on specific signaling words and phrases in the question stem of 2377 USMLE step 1 style questions

Performance of ChatGPT on the MCAT: The Road to Personalized and Equitable Premedical Learning

How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment

How Well Does ChatGPT Do When Taking the Medical Licensing Exams? The Implications of Large Language Models for Medical Education and Knowledge Assessment

Assessing ChatGPT 4.0's test performance and clinical diagnostic accuracy on USMLE STEP 2 CK and clinical case reports

Evaluating the Performance of ChatGPT-4o Vision Capabilities on Image-Based USMLE Step 1, Step 2, and Step 3 Examination Questions

Evaluation of ChatGPT's performance in Medical Education: A Comparative Analysis with Students in a Pulmonology Examination

Evaluating ChatGPT-4 in medical education: an assessment of subject exam performance reveals limitations in clinical curriculum support for students

ChatGPT in medical school: how successful is AI in progress testing?

Performance of ChatGPT on USMLE: Unlocking the Potential of Large Language Models for AI-Assisted Medical Education

Evaluating ChatGPT as a self-learning tool in medical biochemistry: A performance assessment in undergraduate medical university examination

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models

ChatGPT Knowledge Evaluation in Basic and Clinical Medical Sciences: Multiple Choice Question Examination-Based Performance

Is ChatGPT 'ready' to be a learning tool for medical undergraduates and will it perform equally in different subjects? Comparative study of ChatGPT performance in tutorial and case-based learning questions in physiology and biochemistry

Assessing ChatGPT's potential as a clinical resource for medical oncologists: An evaluation with board-style questions and real-world patient cases.

Special Issue on Informatics Education: ChatGPT Performs Worse on USMLE-Style Ethics Questions Compared to Medical Knowledge Questions

Is ChatGPT's Knowledge and Interpretative Ability Comparable to First Professional MBBS (Bachelor of Medicine, Bachelor of Surgery) Students of India in Taking a Medical Biochemistry Examination?

The potential of ChatGPT in medicine: an example analysis of nephrology specialty exams in Poland

Evaluation of ChatGPT as a diagnostic tool for medical learners and clinicians

Sailing the Seven Seas: A Multinational Comparison of ChatGPT's Performance on Medical Licensing Examinations

Evaluating Performance of ChatGPT on MKSAP Cardiology Board Review Questions