Abstract:Abstract Background Patients with chronic diseases exhibit a heightened interest in seeking health information, and access to high-quality information can positively impact clinical outcomes. While previous research on static internet text/video information has highlighted concerns about low-barrier creation leading to low-quality content, it remains uncertain whether similar issues persist in responses generated by Large Language Models (LLMs). Assessing the ability of LLMs in responding to medical queries provides valuable insights for their application in healthcare settings. Methods In alignment with open science principles, we utilized real patient queries from the China Crohn's and Colitis Foundation (CCCF) series "Questions and Answers on Ulcerative Colitis and Crohn's Disease." The dataset comprised questions posed by patients and corresponding answers from medical professionals, collected from outpatient visits and online social media. In September 2023, 263 patient questions were sequentially input into ChatGPT-3.5 (August 3, 2023 version), and the resulting responses were compiled alongside the original medical professional responses, forming 263 modules. Three Inflammatory Bowel Disease (IBD) specialist physicians and three IBD patients were invited to assess each module. Evaluators were instructed to: 1) choose their preferred response version, and 2) provide a multidimensional Likert 5-point subjective assessment using a crowdsourcing strategy. Additionally, the CRIE 3.0 team conducted an automated objective analysis of Simplified Chinese readability. Results Mann-Whitney U tests on text readability levels (median: 7th grade for both medical professionals and ChatGPT responses; Q1: 6th grade; Q3: 8th grade) revealed no significant difference (p=0.87), suggesting ChatGPT's performance align well with recommended literacy levels for popular science publications and is comparable to the average education level in China. Conclusion Cautiously interpreting our findings, ChatGPT's preliminary performance appears comparable to specialized IBD physicians, indicating its potential utility in patient community Q&A. Integrating ChatGPT or similar LLMs into the drafting or refinement stages of health texts is feasible. However, due to the presence of AI hallucinations and the consensus in most experimental conclusions, direct use of large language models for patient Q&A services is not recommended. Recognizing the variability in health information understanding between medical professionals and patients can enhance patient education efforts.

Evaluating the Performance of Different Large Language Models on Health Consultation and Patient Education in Urolithiasis

Evaluation of large language models in breast cancer clinical scenarios: A comparative analysis based on ChatGPT-3.5, ChatGPT-4.0, and Claude2

P717 Evaluating the performance of Large Language Models in responding to patients' health queries: A comparative analysis with medical experts

Assessing the performance of large language models (LLMs) in answering medical questions regarding breast cancer in the Chinese context

Performance of large language models on benign prostatic hyperplasia frequently asked questions

Large Language Models in Pathology: A Comparative Study on Multiple Choice Question Performance with Pathology Trainees

Clinical Accuracy, Relevance, Clarity, and Emotional Sensitivity of Large Language Models to Surgical Patient Questions: Cross-Sectional Study

Evaluating multiple large language models in pediatric ophthalmology

Evaluating Large Language Models in Ophthalmology

Leveraging Large Language Models for Improved Patient Access and Self-Management: Assessor-Blinded Comparison Between Expert- and AI-Generated Content

Comprehensiveness of Large Language Models in Patient Queries on Gingival and Endodontic Health

The performance of large language model powered chatbots compared to oncology physicians on colorectal cancer queries

Comparative Analysis of Performance of Large Language Models in Urogynecology

[Efficiency of different large language models in China in response to consultations about PCa-related perioperative nursing and health education]

Large language models in pathology: A comparative study of ChatGPT and bard with pathology trainees on multiple-choice questions

Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions

Performance of Large Language Models (ChatGPT, Bing Search, and Google Bard) in Solving Case Vignettes in Physiology

Large Language Models in Hematology Case Solving: A Comparative Study of ChatGPT-3.5, Google Bard, and Microsoft Bing

Clinical application potential of large language model: a study based on thyroid nodules

Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study

A Survey of Large Language Models in Medicine: Progress, Application, and Challenge