ChatGPT takes the FCPS exam in Internal Medicine

Hina Qazi,Syed Ahsan,Muhammad Irfan,M. A. Rehman Siddiqui
DOI: https://doi.org/10.1101/2024.06.11.24308808
2024-06-12
Abstract:Large language models (LLMs) have exhibited remarkable proficiency in clinical knowledge, encompassing diagnostic medicine, and have been tested on questions related to medical licensing examinations. ChatGPT has recently gained popularity because of its ability to generate human-like responses when presented with exam questions. It has been tested on multiple undergraduate and subspecialty exams and the results have been mixed. We aim to test ChatGPT on questions mirroring the standards of the FCPS exam, the highest medical qualification in Pakistan. We used 111 randomly chosen MCQs of internal medicine of FCPS level in the form of a text prompt, thrice on 3 consecutive days. The average of the three answers was taken as the final response.  The responses were recorded and compared to the answers given by subject experts. Agreement between the two was assessed using the Chi-square test and Cohen’s Kappa with 0.75 Kappa as an acceptable agreement. Univariate regression analysis was done for the effect of subspeciality, word count, and case scenarios in the success of ChatGPT.. Post-risk stratification chi-square and kappa statistics were applied. ChatGPT 4.0 scored 73% (69%-74%). Although close to the passing criteria, it could not clear the FCPS exam. Question characteristics and subspecialties did not affect the ChatGPT responses statistically. ChatGPT shows a high concordance between its responses indicating sound knowledge and a high reliability.  This study's findings underline the necessity for caution in over-reliance on AI for critical clinical decisions without human oversight. Creating specialized models tailored for medical education could provide a viable solution to this problem.
Medical Education
What problem does this paper attempt to address?