P717 Evaluating the performance of Large Language Models in responding to patients' health queries: A comparative analysis with medical experts

Z Yan,S Lu,D Xu,Y Yang,H Wang,J Mao,Y Fan,Y Chen,H C Tseng

DOI: https://doi.org/10.1093/ecco-jcc/jjad212.0847

2024-01-01

Journal of Crohn's and Colitis

Abstract:Abstract Background Patients with chronic diseases exhibit a heightened interest in seeking health information, and access to high-quality information can positively impact clinical outcomes. While previous research on static internet text/video information has highlighted concerns about low-barrier creation leading to low-quality content, it remains uncertain whether similar issues persist in responses generated by Large Language Models (LLMs). Assessing the ability of LLMs in responding to medical queries provides valuable insights for their application in healthcare settings. Methods In alignment with open science principles, we utilized real patient queries from the China Crohn's and Colitis Foundation (CCCF) series "Questions and Answers on Ulcerative Colitis and Crohn's Disease." The dataset comprised questions posed by patients and corresponding answers from medical professionals, collected from outpatient visits and online social media. In September 2023, 263 patient questions were sequentially input into ChatGPT-3.5 (August 3, 2023 version), and the resulting responses were compiled alongside the original medical professional responses, forming 263 modules. Three Inflammatory Bowel Disease (IBD) specialist physicians and three IBD patients were invited to assess each module. Evaluators were instructed to: 1) choose their preferred response version, and 2) provide a multidimensional Likert 5-point subjective assessment using a crowdsourcing strategy. Additionally, the CRIE 3.0 team conducted an automated objective analysis of Simplified Chinese readability. Results Mann-Whitney U tests on text readability levels (median: 7th grade for both medical professionals and ChatGPT responses; Q1: 6th grade; Q3: 8th grade) revealed no significant difference (p=0.87), suggesting ChatGPT's performance align well with recommended literacy levels for popular science publications and is comparable to the average education level in China. Conclusion Cautiously interpreting our findings, ChatGPT's preliminary performance appears comparable to specialized IBD physicians, indicating its potential utility in patient community Q&A. Integrating ChatGPT or similar LLMs into the drafting or refinement stages of health texts is feasible. However, due to the presence of AI hallucinations and the consensus in most experimental conclusions, direct use of large language models for patient Q&A services is not recommended. Recognizing the variability in health information understanding between medical professionals and patients can enhance patient education efforts.

gastroenterology & hepatology

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the performance of large - language models (LLMs) in responding to patients' health inquiries and conduct a comparative analysis with the responses of medical experts. Specifically, the research background points out that patients with chronic diseases show a strong interest in seeking health information, and high - quality information can positively affect clinical outcomes. Although previous studies on static Internet text/video information have shown that low - threshold creation may lead to low content quality, it is still unclear whether similar problems exist in the responses generated by large - language models. Therefore, evaluating the ability of LLMs to respond to medical queries provides valuable insights for their application in the medical environment. To answer this question, the researchers used a real - patient - consultation dataset from the Crohn's and Colitis Foundation of China (CCCF) series "Ulcerative Colitis and Crohn's Disease Q&A", which includes questions asked by patients and the corresponding answers provided by medical professionals. By inputting these questions into a specific version of ChatGPT - 3.5, collecting the responses it generated, and comparing them with the original medical professionals' responses, 263 modules were formed. Subsequently, three specialists in inflammatory bowel disease (IBD) and three IBD patients were invited to evaluate each module. The evaluation criteria included choosing the more preferred response version and using a five - point Likert scale for multi - dimensional subjective evaluation. In addition, the CRIE 3.0 team also conducted an automatic objective analysis of the readability of simplified Chinese. The research results show that the readability level of the responses generated by ChatGPT is comparable to that of medical professionals, with no significant difference (\(p = 0.87\)), indicating that ChatGPT's performance meets the literacy level recommended for popular science publications and is equivalent to the average education level in China. The conclusion part points out that although the preliminary results show that ChatGPT's performance is comparable to that of IBD specialists, indicating its potential use in patient - community Q&A, due to the phenomenon of AI hallucination and the consistency of most experimental conclusions, directly using large - language models for patient - Q&A services is not recommended. At the same time, recognizing the differences in understanding health information between medical professionals and patients can enhance patient education.

P717 Evaluating the performance of Large Language Models in responding to patients' health queries: A comparative analysis with medical experts

Assessing the performance of large language models (LLMs) in answering medical questions regarding breast cancer in the Chinese context

The performance of large language model powered chatbots compared to oncology physicians on colorectal cancer queries

Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study

Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study

Leveraging Large Language Models for Improved Patient Access and Self-Management: Assessor-Blinded Comparison Between Expert- and AI-Generated Content

Physician Versus Large Language Model Chatbot Responses to Web-Based Questions From Autistic Patients in Chinese: Cross-Sectional Comparative Analysis

Large language model answers medical questions about standard pathology reports

Performance of Large Language Models in Patient Complaint Resolution: Web-Based Cross-Sectional Survey

Evaluation of ChatGPT-generated medical responses: A systematic review and meta-analysis

ChatGPT vs Medical Professional: Analyzing Responses to Laboratory Medicine Questions on Social Media

Assessing ChatGPT as a Medical Consultation Assistant for Chronic Hepatitis B: Cross-Language Study of English and Chinese

Comprehensiveness of Large Language Models in Patient Queries on Gingival and Endodontic Health

Performance of ChatGPT on the Chinese Postgraduate Examination for Clinical Medicine: Survey Study

Utility of Large Language Models for Health Care Professionals and Patients in Navigating Hematopoietic Stem Cell Transplantation: Comparison of the Performance of ChatGPT-3.5, ChatGPT-4, and Bard

Large language models encode medical oncology knowledge: Performance on the ASCO and ESMO examination questions.

Evaluating the Performance of Different Large Language Models on Health Consultation and Patient Education in Urolithiasis

Comparison of Large Language Models in Answering Immuno-Oncology Questions: A Cross-Sectional Study

Clinical application potential of large language model: a study based on thyroid nodules

Performance Evaluation of Lightweight Open-source Large Language Models in Pediatric Consultations: A Comparative Analysis

Data Set and Benchmark (MedGPTEval) to Evaluate Responses From Large Language Models in Medicine: Evaluation Development and Validation