In-depth analysis of ChatGPT's performance based on specific signaling words and phrases in the question stem of 2377 USMLE step 1 style questions

Leonard Knoedler,Samuel Knoedler,Cosima C. Hoch,Lukas Prantl,Konstantin Frank,Laura Soiderer,Sebastian Cotofana,Amir H. Dorafshar,Thilo Schenck,Felix Vollbach,Giuseppe Sofo,Michael Alfertshofer
DOI: https://doi.org/10.1038/s41598-024-63997-7
IF: 4.6
2024-06-13
Scientific Reports
Abstract:ChatGPT has garnered attention as a multifaceted AI chatbot with potential applications in medicine. Despite intriguing preliminary findings in areas such as clinical management and patient education, there remains a substantial knowledge gap in comprehensively understanding the chances and limitations of ChatGPT's capabilities, especially in medical test-taking and education. A total of n = 2,729 USMLE Step 1 practice questions were extracted from the Amboss question bank. After excluding 352 image-based questions, a total of 2,377 text-based questions were further categorized and entered manually into ChatGPT, and its responses were recorded. ChatGPT's overall performance was analyzed based on question difficulty, category, and content with regards to specific signal words and phrases. ChatGPT achieved an overall accuracy rate of 55.8% in a total number of n = 2,377 USMLE Step 1 preparation questions obtained from the Amboss online question bank. It demonstrated a significant inverse correlation between question difficulty and performance with r s = -0.306; p < 0.001, maintaining comparable accuracy to the human user peer group across different levels of question difficulty. Notably, ChatGPT outperformed in serology-related questions (61.1% vs. 53.8%; p = 0.005) but struggled with ECG-related content (42.9% vs. 55.6%; p = 0.021). ChatGPT achieved statistically significant worse performances in pathophysiology-related question stems. (Signal phrase = "what is the most likely/probable cause"). ChatGPT performed consistent across various question categories and difficulty levels. These findings emphasize the need for further investigations to explore the potential and limitations of ChatGPT in medical examination and education.
multidisciplinary sciences
What problem does this paper attempt to address?
The paper attempts to address the issue of evaluating the performance of the AI chatbot ChatGPT in the USMLE (United States Medical Licensing Examination) Step 1 exam. Specifically, the researchers extracted 2,377 text-based USMLE Step 1 practice questions from the Amboss question bank and had ChatGPT answer them to analyze its overall accuracy, performance at different difficulty levels, and its ability to respond to specific signal words and phrases. The main objectives include: 1. **Evaluating ChatGPT's overall performance on USMLE Step 1 questions**: The researchers found that ChatGPT's overall accuracy was 55.8%, close to the 60% passing threshold. 2. **Analyzing performance at different difficulty levels**: ChatGPT's performance was significantly negatively correlated with question difficulty (rs = -0.306; p < 0.001), meaning the higher the difficulty, the worse the performance. 3. **Exploring the impact of specific signal words and phrases**: The study found that ChatGPT performed better on serology-related questions (61.1% vs. 53.8%) but worse on electrocardiogram (ECG) related questions (42.9% vs. 55.6%). 4. **Comparing performance across different medical specialties**: ChatGPT performed best in the behavioral health category (77.9%) and worst in the cardiovascular system category (44.7%). 5. **Investigating the impact of age groups on performance**: ChatGPT performed better in questions related to younger patient groups compared to older patient groups. Through these analyses, the study aims to gain a deeper understanding of ChatGPT's potential and limitations in medical exams and education, providing a reference for future research.