Abstract:Background Although chat generative pre-trained transformer ( ChatGPT) has made several successful attempts in the medical field, most notably in answering medical questions in English, no studies have evaluated ChatGPT's performance in a Chinese context for a medical task. Objective The aim of this study was to evaluate ChatGPT's ability to understand medical knowledge in Chinese, as well as its potential to serve as an electronic health infrastructure for medical development, by evaluating its performance in medical examinations, records, and education. Method The Chinese (CNMLE) and English (ENMLE) datasets of the China National Medical Licensing Examination and the Chinese dataset (NEEPM) of the China National Entrance Examination for Postgraduate Clinical Medicine Comprehensive Ability were used to evaluate the performance of ChatGPT (GPT-3.5 and GPT-4). We assessed answer accuracy, verbal fluency, and the classification of incorrect responses owing to hallucinations on multiple occasions. In addition, we tested ChatGPT's performance on discharge summaries and group learning in a Chinese context on a small scale. Results The accuracy of GPT-3.5 in CNMLE, ENMLE, and NEEPM was 56% (56/100), 76% (76/100), and 62% (62/100), respectively, compared to that of GPT-4, which was of 84% (84/100), 86% (86/100), and 82% (82/100). The verbal fluency of all the ChatGPT responses exceeded 95%. Among the GPT-3.5 incorrect responses, the proportions of open-domain hallucinations were 66 % (29/44), 54 % (14/24), and 63 % (24/38), whereas close-domain hallucinations accounted for 34 % (15/44), 46 % (14/24), and 37 % (14/38), respectively. By contrast, GPT-4 open-domain hallucinations accounted for 56% (9/16), 43% (6/14), and 83% (15/18), while close-domain hallucinations accounted for 44% (7/16), 57% (8/14), and 17% (3/18), respectively. In the discharge summary, ChatGPT demonstrated logical coherence, however GPT-3.5 could not fulfill the quality requirements, while GPT-4 met the qualification of 60% (6/10). In group learning, the verbal fluency and interaction satisfaction with ChatGPT were 100% (10/10). Conclusion ChatGPT based on GPT-4 is at par with Chinese medical practitioners who passed the CNMLE and at the standard required for admission to clinical medical graduate programs in China. The GPT-4 shows promising potential for discharge summarization and group learning. Additionally, it shows high verbal fluency, resulting in a positive human–computer interaction experience. GPT-4 significantly improves multiple capabilities and reduces hallucinations compared to the previous GPT-3.5 model, with a particular leap forward in the Chinese comprehension capability of medical tasks. Artificial intelligence (AI) systems face the challenges of hallucinations, legal risks, and ethical issues. However, we discovered ChatGPT's potential to promote medical development as an electronic health infrastructure, paving the way for Medical AI to become necessary.

Comparison of the problem-solving performance of ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard for the Korean emergency medicine board examination question bank

Human versus Artificial Intelligence: ChatGPT-4 Outperforming Bing, Bard, ChatGPT-3.5, and Humans in Clinical Chemistry Multiple-Choice Questions

Performance of ChatGPT in the In-Training Examination for Anesthesiology and Pain Medicine Residents in South Korea: Observational Study

Comparison of artificial intelligence large language model chatbots in answering frequently asked questions in anaesthesia

A Comparative Analysis of ChatGPT and Google’s AI’s “Bard” in Medicine

Performance of ChatGPT on Solving Orthopedic Board-Style Questions: A Comparative Analysis of ChatGPT 3.5 and ChatGPT 4

Assessment of ChatGPT-4 in Family Medicine Board Examinations Using Advanced AI Learning and Analytical Methods: Observational Study

Investigating the impact of innovative AI chatbot on post‐pandemic medical education and clinical assistance: a comprehensive analysis

Analysing the Applicability of ChatGPT, Bard, and Bing to Generate Reasoning-Based Multiple-Choice Questions in Medical Physiology

How does ChatGPT-4 preform on non-English national medical licensing examination? An evaluation in Chinese language

Performance of artificial intelligence chatbots in sleep medicine certification board exams: ChatGPT versus Google Bard

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models

Application, investigation and prediction of ChatGpt/GPT-4 for clinical cases in medical field

Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis

Performance and exploration of ChatGPT in medical examination, records and education in Chinese: Pave the way for medical AI

A Clinical Evaluation of Cardiovascular Emergencies: A Comparison of Responses from ChatGPT, Emergency Physicians, and Cardiologists

Artificial Intelligence for Anesthesiology Board–Style Examination Questions: Role of Large Language Models

ChatGPT in Iranian medical licensing examination: evaluating the diagnostic accuracy and decision-making capabilities of an AI-based model

Evaluating Chatgpt As an Adjunct for Analyzing Challenging Case

S2150 Language AI Bots as Pioneers in Diagnosis? A Retrospective Analysis on Real-time Patients at a Tertiary Care Center

Artificial Intelligence Versus Medical Students in General Surgery Exam