Abstract:Purpose Artificial intelligence (AI) has rapidly gained popularity with the growth of ChatGPT (OpenAI, San Francisco, USA) and other large-language model chatbots, and these programs have tremendous potential to impact medicine. One important area of consequence in medicine and public health is that patients may use these programs in search of answers to medical questions. Despite the increased utilization of AI chatbots by the public, there is little research to assess the reliability of ChatGPT and alternative programs when queried for medical information. This study seeks to elucidate the accuracy and readability of AI chatbots in answering patient questions regarding urology. As vasectomy is one of the most common urologic procedures, this study investigates AI-generated responses to frequently asked vasectomy-related questions. For this study, five popular and free-to-access AI platforms were utilized to undertake this investigation. Methods Fifteen vasectomy-related questions were individually queried to five AI chatbots from November-December 2023: ChatGPT (OpenAI, San Francisco, USA), Bard (Google Inc., Mountainview, USA) Bing (Microsoft, Redmond, USA) Perplexity (Perplexity AI Inc., San Francisco, USA), and Claude (Anthropic, San Francisco, USA). Responses from each platform were graded by two attending urologists, two urology research faculty, and one urological resident physician using a Likert (1-6) scale: (1-completely inaccurate, 6-completely accurate) based on comparison to existing American Urological Association guidelines. Flesch-Kincaid Grade levels (FKGL) and Flesch Reading Ease scores (FRES) (1-100) were calculated for each response. To assess differences in Likert, FRES, and FKGL, Kruskal-Wallis tests were performed using GraphPad Prism V10.1.0 (GraphPad, San Diego, USA) with Alpha set at 0.05. Results Analysis shows that ChatGPT provided the most accurate responses across the five AI chatbots with an average score of 5.04 on the Likert scale. Subsequently, Microsoft Bing (4.91), Anthropic Claude (4.65), Google Bard (4.43), and Perplexity (4.41) followed. All five chatbots were found to score, on average, higher than 4.41 corresponding to a score of at least "somewhat accurate." Google Bard received the highest Flesch Reading Ease score (49.67) and lowest Grade level (10.1) when compared to the other chatbots. Anthropic Claude scored 46.7 on the FRES and 10.55 on the FKGL. Microsoft Bing scored 45.57 on the FRES and 11.56 on the FKGL. Perplexity scored 36.4 on the FRES and 13.29 on the FKGL. ChatGPT had the lowest FRES of 30.4 and highest FKGL of 14.2. Conclusion This study investigates the use of AI in medicine, specifically urology, and it helps to determine whether large-language model chatbots can be reliable sources of freely available medical information. All five AI chatbots on average were able to achieve at least "somewhat accurate" on a 6-point Likert scale. In terms of readability, all five AI chatbots on average had Flesch Reading Ease scores of less than 50 and were higher than a 10th-grade level. In this small-scale study, there were several significant differences identified between the readability scores of each AI chatbot. However, there were no significant differences found among their accuracies. Thus, our study suggests that major AI chatbots may perform similarly in their ability to be correct but differ in their ease of being comprehended by the general public.

Quality of Information About Kidney Stones from Artificial Intelligence Chatbots

STILL USING ONLY CHATGPT? THE COMPARISON OF FIVE DIFFERENT ARTIFICIAL INTELLIGENCE CHATBOTS' ANSWERS TO THE MOST COMMON QUESTIONS ABOUT KIDNEY STONES

AI-Driven Patient Education in Chronic Kidney Disease: Evaluating Chatbot Responses against Clinical Guidelines

Artificial intelligence-powered chatbots in search engines: a cross-sectional study on the quality and risks of drug information for patients

Assessing the Quality of Patient Education Materials on Cardiac Catheterization From Artificial Intelligence Chatbots: An Observational Cross-Sectional Study

Accuracy and Readability of Kidney Stone Patient Information Materials Generated by a Large Language Model Compared to Official Urologic Organizations

Can Patients With Urogenital Cancer Rely on Artificial Intelligence Chatbots for Treatment Decisions?

Battle of the bots: a comparative analysis of ChatGPT and bing AI for kidney stone-related questions

Using ChatGPT for Kidney Transplantation: Perceived Information Quality by Race and Education Levels

Dr. Google to Dr. ChatGPT: assessing the content and quality of artificial intelligence-generated medical information on appendicitis

Accuracy and Readability of Artificial Intelligence Chatbot Responses to Vasectomy-Related Questions: Public Beware

Artificial intelligence improves urologic oncology patient education and counseling

Quality of Chatbot Information Related to Benign Prostatic Hyperplasia

Empowering patients: how accurate and readable are large language models in renal cancer education

Assessing the Readability of Patient Education Materials on Cardiac Catheterization From Artificial Intelligence Chatbots: An Observational Cross-Sectional Study

AI Chatbot-Assisted Dietary Management of Oxalate for Kidney Stone Prevention

Evaluation of the Current Status of Artificial Intelligence for Endourology Patient Education: A Blind Comparison of ChatGPT and Google Bard against Traditional Information Resources

Artificial Intelligence Chatbots' Understanding of the Risks and Benefits of Computed Tomography and Magnetic Resonance Imaging Scenarios

Evaluating the Efficacy of ChatGPT as a Patient Education Tool in Prostate Cancer: Multimetric Assessment

Reliability of Medical Information Provided by ChatGPT: Assessment Against Clinical Guidelines and Patient Information Quality Instrument

Accuracy of Prospective Assessments of 4 Large Language Model Chatbot Responses to Patient Questions About Emergency Care: Experimental Comparative Study