Abstract:Background Although chat generative pre-trained transformer ( ChatGPT) has made several successful attempts in the medical field, most notably in answering medical questions in English, no studies have evaluated ChatGPT's performance in a Chinese context for a medical task. Objective The aim of this study was to evaluate ChatGPT's ability to understand medical knowledge in Chinese, as well as its potential to serve as an electronic health infrastructure for medical development, by evaluating its performance in medical examinations, records, and education. Method The Chinese (CNMLE) and English (ENMLE) datasets of the China National Medical Licensing Examination and the Chinese dataset (NEEPM) of the China National Entrance Examination for Postgraduate Clinical Medicine Comprehensive Ability were used to evaluate the performance of ChatGPT (GPT-3.5 and GPT-4). We assessed answer accuracy, verbal fluency, and the classification of incorrect responses owing to hallucinations on multiple occasions. In addition, we tested ChatGPT's performance on discharge summaries and group learning in a Chinese context on a small scale. Results The accuracy of GPT-3.5 in CNMLE, ENMLE, and NEEPM was 56% (56/100), 76% (76/100), and 62% (62/100), respectively, compared to that of GPT-4, which was of 84% (84/100), 86% (86/100), and 82% (82/100). The verbal fluency of all the ChatGPT responses exceeded 95%. Among the GPT-3.5 incorrect responses, the proportions of open-domain hallucinations were 66 % (29/44), 54 % (14/24), and 63 % (24/38), whereas close-domain hallucinations accounted for 34 % (15/44), 46 % (14/24), and 37 % (14/38), respectively. By contrast, GPT-4 open-domain hallucinations accounted for 56% (9/16), 43% (6/14), and 83% (15/18), while close-domain hallucinations accounted for 44% (7/16), 57% (8/14), and 17% (3/18), respectively. In the discharge summary, ChatGPT demonstrated logical coherence, however GPT-3.5 could not fulfill the quality requirements, while GPT-4 met the qualification of 60% (6/10). In group learning, the verbal fluency and interaction satisfaction with ChatGPT were 100% (10/10). Conclusion ChatGPT based on GPT-4 is at par with Chinese medical practitioners who passed the CNMLE and at the standard required for admission to clinical medical graduate programs in China. The GPT-4 shows promising potential for discharge summarization and group learning. Additionally, it shows high verbal fluency, resulting in a positive human–computer interaction experience. GPT-4 significantly improves multiple capabilities and reduces hallucinations compared to the previous GPT-3.5 model, with a particular leap forward in the Chinese comprehension capability of medical tasks. Artificial intelligence (AI) systems face the challenges of hallucinations, legal risks, and ethical issues. However, we discovered ChatGPT's potential to promote medical development as an electronic health infrastructure, paving the way for Medical AI to become necessary.

Multimodal ChatGPT for Medical Applications: an Experimental Study of GPT-4V

Capability of GPT-4V(ision) in Japanese National Medical Licensing Examination

A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical Image Analysis

A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical Imaging

Can GPT-4V(ision) Serve Medical Applications? Case Studies on GPT-4V for Multimodal Medical Diagnosis

Step into the era of large multimodal models: a pilot study on ChatGPT-4V(ision)'s ability to interpret radiological images

Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine

Performance of Multimodal GPT-4V on USMLE with Image: Potential for Imaging Diagnostic Support with Explanations

GPT-4 Vision: Multi-Modal Evolution of ChatGPT and Potential Role in Radiology

Evaluating General Vision-Language Models for Clinical Medicine

Performance and exploration of ChatGPT in medical examination, records and education in Chinese: Pave the way for medical AI

Unveiling the Clinical Incapabilities: A Benchmarking Study of GPT-4V(ision) for Ophthalmic Multimodal Image Analysis

A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual Question Answering

GPT-4V(ision) Unsuitable for Clinical Care and Education: A Clinician-Evaluated Assessment

Keeping Up With ChatGPT

Diagnostic Accuracy of GPT Multimodal Analysis on USMLE Questions Including Text and Visuals

Application, investigation and prediction of ChatGpt/GPT-4 for clinical cases in medical field

Holistic Evaluation of GPT-4V for Biomedical Imaging

Enhancing Medical Task Performance in GPT-4V: A Comprehensive Study on Prompt Engineering Strategies

Comparative Analysis of GPT-4Vision, GPT-4 and Open Source LLMs in Clinical Diagnostic Accuracy: A Benchmark Against Human Expertise