Abstract:BACKGROUND With the increasing interest of Large Language Models’ (LLMs) application in the medical field, the feasibility of its potential usage as a Standardized Patient (SP) in medical assessment is rarely evaluated. Specifically, we delved into the potential of using ChatGPT, a representative LLM, in transforming medical education by serving as a cost-effective alternative to SPs, specifically for history-taking tasks. OBJECTIVE The study aims to explore ChatGPT's viability and performance as an SP, employing prompt engineering to refine its accuracy and utility in medical assessments. METHODS A two-phase experiment was designed to assess ChatGPT's viability as an SP in medical education. The first phase tested the feasibility through simulating conversations on Inflammatory Bowel Disease (IBD), categorizing responses into poor, medium, and good inquiries based on relevance and accuracy. For the second phase, a more structured experiment used detailed scripts to evaluate ChatGPT's performance against specific criteria, focusing on its anthropomorphism, clinical accuracy, and adaptability. Adjustments were made to prompts based on ChatGPT's response shortcomings, with a comparative analysis of ChatGPT’s performance between original and revised prompts to track improvements. The methodology included statistical analysis to ensure rigorous evaluation, with data collected between November and December 2023. RESULTS The feasibility test confirmed ChatGPT's ability to simulate an SP effectively, differentiating between poor, medium, and good medical inquiries with varying degrees of accuracy. Score differences between the poor (74.7, SD=5.44) and medium (82.67, SD=5.30) inquiry groups (P < .001), between the poor and good (85, SD=3.27) inquiry groups (P < .001) were significant at a significance level of α = .05, while the score differences between the medium and good inquiry groups were not statistically significant (P= .158). The feasibility test took 90 runs. However, the performance is not ideal without proper prompt restriction. Subsequent performance enhancements, including the use of revised prompts, instructed ChatGPT to avoid medical jargon for realism, provide accurate and concise responses for clinical accuracy, and improve its grading accuracy and adaptability by following specific prompts. The total number of trials in the second experimental phase was 300. The revised prompt significantly improved ChatGPT's realism, clinical accuracy, and adaptability, leading to a marked reduction in scoring discrepancies. The score accuracy of ChatGPT improved 4.926 times compared to unrevised prompt. The score difference percentage (SDP) drops from 29.83% to 6.06%, with a drop in standard deviation from 0.55 to 0.068. CONCLUSIONS ChatGPT, as a representative LLM, is a viable tool for simulating SPs in medical assessments, with the potential to enhance medical training. By incorporating detailed and targeted prompts, ChatGPT's scoring accuracy and response realism significantly improve, approaching the feasibility of actual clinical use. However, despite promising outcomes, continuous refinement is essential to fully establish LLM’s (such as ChatGPT) reliability in clinical assessment settings.

Leveraging Large Language Models for Improved Patient Access and Self-Management in Oral Healthcare: an Assessor-blinded Preclinical Study (Preprint)

Leveraging Large Language Models for Improved Patient Access and Self-Management: Assessor-Blinded Comparison Between Expert- and AI-Generated Content

Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study

P717 Evaluating the performance of Large Language Models in responding to patients' health queries: A comparative analysis with medical experts

Comprehensiveness of Large Language Models in Patient Queries on Gingival and Endodontic Health

Utility of Large Language Models for Health Care Professionals and Patients in Navigating Hematopoietic Stem Cell Transplantation: Comparison of the Performance of ChatGPT-3.5, ChatGPT-4, and Bard

Evaluating the efficacy of leading large language models in the Japanese national dental hygienist examination: A comparative analysis of ChatGPT, Bard, and Bing Chat

ChatGPT-3.5, ChatGPT-4, Google Bard, and Microsoft Bing to Improve Health Literacy and Communication in Pediatric Populations and Beyond

A Survey of Clinicians’ Views of the Utility of Large Language Models

Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing

Are Large Language Models Ready for Healthcare? A Comparative Study on Clinical Language Understanding

Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study

Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study

Leveraging Large Language Models in the delivery of post-operative dental care: a comparison between an embedded GPT model and ChatGPT

Clinical Accuracy, Relevance, Clarity, and Emotional Sensitivity of Large Language Models to Surgical Patient Questions: Cross-Sectional Study

Evaluation of large language model performance on the Biomedical Language Understanding and Reasoning Benchmark

Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks

Comparison of Large Language Models in Answering Immuno-Oncology Questions: A Cross-Sectional Study

Application of Large Language Models in Medical Training Evaluation: Can ChatGPT Be a Standardized Patient? an Exploratory Study (Preprint)

A Systematic Review of ChatGPT and Other Conversational Large Language Models in Healthcare

Applications and Concerns of ChatGPT and Other Conversational Large Language Models in Health Care: Systematic Review