Abstract:Introduction Large Language Model (LLM) applications in Medicine are increasing. Chatbots ChatGPT and GPT4 were tested against medical exams. The vetting of these tests as a source of information for both the patient and physician is needed. A credible source of medical information may be used as the gold standard against which the LLM can be tested. Objective We assess the performance of ChatGPT and GPT4 to answer the American Urological Association (AUA) Self-Assessment Study Program (SASP) questions on male sexual dysfunction (MSD), female sexual dysfunction (FSD), sexually transmitted infection (STI), and male factor infertility (MFI). We aim to find out how credible this LLM is as a source of medical advice. Methods Four registered users of the SASP identified the questions using open book mode in tests from 2019 to 2023, spanning their subscriptions. The questions were ranked for difficulty on a five-point Likert scale. OpenAI ChatGPT 3.5 and GPT4 were used to answer the questions. The GPT program was set to turn off chat history. No plug-ins or feedback were permitted. Prompts were generated from question stems masking question ID and adding the phrase "I am a urologist preparing for my board exam. Please answer the following question:" Images were deleted from questions and substituted with a brief description. All questions were answered separately, first using GPT-4 and then ChatGPT. Three consecutive responses were generated for each prompt, and the consensus answer was tallied. Answers were compared between GPT and provided answer key of SASP. Descriptive statistics, chi-Pearson Chi-square, and Fisher's exact tests were applied. Results We identified 115 questions in the domains of sexual dysfunction, STI, and MFI. Only one question had an associated image. GPT4 performed better than ChatGPT in all domains, providing correct answers of 60% versus 40% (p=0.000). A total of 89.9% of correct answers were three times unanimously regenerated by GPT4 (p=0.007) and 67.4% by ChatGPT (p=0.244). Within each domain, there was no significant difference in the correct answers for each chatbot (Table-1) or test year (p=0.682). The SASP source reference answers were Campbell's Urology only in 38.3 %, AUA guidelines only in 17.4%, AUA core curriculum only in 10.4%, and combinations with other sources in 33.9%. There was no significant association of source reference with the correct answers of GPT4 (p=0.058) or ChatGPT (p=0.451). Only 19.1% were open-access sources, 20% were partially open-access, and 60.9% were restricted to subscribers. The availability of the source did not significantly affect the correct answers of GPT4 (p=0.272) or ChatGPT (p=0.231). Both chatbots' correct answers were associated significantly with easier questions (Table-2). Conclusions The LLM tested has an average accuracy as a source of credible medical information on Sexual dysfunction, STI, and MFI. GPT4 performs better than ChatGPT, especially when there is a unanimous regenerated response. The development of a better model to serve a broader group of physicians and patients will require training of the chatbot on credible urology literature that includes the currently 81% restricted sources. Disclosure No.

Performance of ChatGPT-3.5 and ChatGPT-4 on the European Board of Urology (EBU) exams: a comparative analysis

Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis

Performance of Progressive Generations of GPT on an Exam Designed for Certifying Physicians as Certified Clinical Densitometrists

Evaluating the performance of ChatGPT-4 on the United Kingdom Medical Licensing Assessment

Evaluating the performance of ChatGPT-3.5 and ChatGPT-4 on the Taiwan plastic surgery board examination

Evaluating the Efficacy of AI Chatbots as Tutors in Urology: A Comparative Analysis of Responses to the 2022 In-Service Assessment of the European Board of Urology

Evaluating the competency of ChatGPT in MRCP Part 1 and a systematic literature review of its capabilities in postgraduate medical assessments

Performance of artificial intelligence on a simulated Canadian urology board exam

Comparing the Performance of ChatGPT and GPT-4 versus a Cohort of Medical Students on an Official University of Toronto Undergraduate Medical Education Progress Test

Can ChatGPT pass the MRCP (UK) written examinations? Analysis of performance and errors using a clinical decision-reasoning framework

Evaluating the limits of AI in medical specialisation: ChatGPT’s performance on the UK Neurology Specialty Certificate Examination

Evaluating ChatGPT-4 in medical education: an assessment of subject exam performance reveals limitations in clinical curriculum support for students

Influence of Model Evolution and System Roles on ChatGPT's Performance in Chinese Medical Licensing Exams: Comparative Study

(033) Artificial Intelligence ChatGPT and GPT4 Performance on Male and Female Sexual Dysfunction, Sexually Transmitted Infection, and Male Factor Infertility in the 2019 to 2023 American Urological Association Self-Assessment Study Programs

Comprehensive analysis of the performance of GPT-3.5 and GPT-4 on the American Urological Association self-assessment study program exams from 2012-2023

How Well Does ChatGPT Do When Taking the Medical Licensing Exams? The Implications of Large Language Models for Medical Education and Knowledge Assessment

Assessing ChatGPT 4.0's test performance and clinical diagnostic accuracy on USMLE STEP 2 CK and clinical case reports

Performance of ChatGPT on American Board of Surgery In-Training Examination Preparation Questions

Evaluating ChatGPT Performance on the Orthopaedic In-Training Examination

ChatGPT-4 Surpasses Residents: A Study of Artificial Intelligence Competency in Plastic Surgery In-service Examinations and Its Advancements from ChatGPT-3.5

Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis