Abstract:BACKGROUND With the increasing interest of Large Language Models’ (LLMs) application in the medical field, the feasibility of its potential usage as a Standardized Patient (SP) in medical assessment is rarely evaluated. Specifically, we delved into the potential of using ChatGPT, a representative LLM, in transforming medical education by serving as a cost-effective alternative to SPs, specifically for history-taking tasks. OBJECTIVE The study aims to explore ChatGPT's viability and performance as an SP, employing prompt engineering to refine its accuracy and utility in medical assessments. METHODS A two-phase experiment was designed to assess ChatGPT's viability as an SP in medical education. The first phase tested the feasibility through simulating conversations on Inflammatory Bowel Disease (IBD), categorizing responses into poor, medium, and good inquiries based on relevance and accuracy. For the second phase, a more structured experiment used detailed scripts to evaluate ChatGPT's performance against specific criteria, focusing on its anthropomorphism, clinical accuracy, and adaptability. Adjustments were made to prompts based on ChatGPT's response shortcomings, with a comparative analysis of ChatGPT’s performance between original and revised prompts to track improvements. The methodology included statistical analysis to ensure rigorous evaluation, with data collected between November and December 2023. RESULTS The feasibility test confirmed ChatGPT's ability to simulate an SP effectively, differentiating between poor, medium, and good medical inquiries with varying degrees of accuracy. Score differences between the poor (74.7, SD=5.44) and medium (82.67, SD=5.30) inquiry groups (P < .001), between the poor and good (85, SD=3.27) inquiry groups (P < .001) were significant at a significance level of α = .05, while the score differences between the medium and good inquiry groups were not statistically significant (P= .158). The feasibility test took 90 runs. However, the performance is not ideal without proper prompt restriction. Subsequent performance enhancements, including the use of revised prompts, instructed ChatGPT to avoid medical jargon for realism, provide accurate and concise responses for clinical accuracy, and improve its grading accuracy and adaptability by following specific prompts. The total number of trials in the second experimental phase was 300. The revised prompt significantly improved ChatGPT's realism, clinical accuracy, and adaptability, leading to a marked reduction in scoring discrepancies. The score accuracy of ChatGPT improved 4.926 times compared to unrevised prompt. The score difference percentage (SDP) drops from 29.83% to 6.06%, with a drop in standard deviation from 0.55 to 0.068. CONCLUSIONS ChatGPT, as a representative LLM, is a viable tool for simulating SPs in medical assessments, with the potential to enhance medical training. By incorporating detailed and targeted prompts, ChatGPT's scoring accuracy and response realism significantly improve, approaching the feasibility of actual clinical use. However, despite promising outcomes, continuous refinement is essential to fully establish LLM’s (such as ChatGPT) reliability in clinical assessment settings.

DoctorGLM: Fine-tuning your Chinese Doctor is not a Herculean Task

TCMChat: A Generative Large Language Model for Traditional Chinese Medicine

ChatDoctor: A Medical Chat Model Fine-tuned on LLaMA Model using Medical Domain Knowledge

Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-world Multi-turn Dialogue

ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge

DoctorGPT: A Large Language Model with Chinese Medical Question-Answering Capabilities

ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences

MedChatZH: a Better Medical Adviser Learns from Better Instructions

LLMs for Doctors: Leveraging Medical LLMs to Assist Doctors, Not Replace Them

Uncovering Language Disparity of ChatGPT in Healthcare: Non-English Clinical Environment for Retinal Vascular Disease Classification (Preprint)

MedChatZH: A tuning LLM for traditional Chinese medicine consultations

Development and evaluation of a large language model of ophthalmology in Chinese

A Role-specific Guided Large Language Model for Ophthalmic Consultation Based on Stylistic Differentiation

Large Language Models Leverage External Knowledge to Extend Clinical Insight Beyond Language Boundaries

Application of Large Language Models in Medical Training Evaluation: Can ChatGPT Be a Standardized Patient? an Exploratory Study (Preprint)

P717 Evaluating the performance of Large Language Models in responding to patients' health queries: A comparative analysis with medical experts

HuatuoGPT-II, One-stage Training for Medical Adaption of LLMs

Enhancing Clinical Accuracy of Medical Chatbots with Large Language Models

PediatricsGPT: Large Language Models as Chinese Medical Assistants for Pediatric Applications

Data Set and Benchmark (MedGPTEval) to Evaluate Responses From Large Language Models in Medicine: Evaluation Development and Validation

A Survey of Large Language Models in Medicine: Progress, Application, and Challenge