Abstract:Abstract Large language models (LLMs) such as ChatGPT can imitate human conversation and produce rapid, coherent responses, which may mask their potential for inaccuracies. With patients increasingly turning to the internet for medical information, the use of LLM chatbots for cancer-related queries risks spreading misinformation. Our study assessed ChatGPT’s accuracy and reproducibility in offering valid information and treatment advice for lung cancer in line with established guidelines. In the evolving landscape of AI-driven healthcare support, the ability of language models to provide accurate and reliable information is crucial. Our study delves into the effectiveness of OpenAI's ChatGPT models (versions 3.5 and 4.0) in responding to patient inquiries about lung cancer across various domains including general information, clinical presentation, risk factors, screening, diagnosis, staging, treatment options, prognosis, post-treatment follow-up, lifestyle recommendations, and psychosocial/educational aspects. We conducted a structured assessment, posing identical sets of questions to both ChatGPT 3.5 and 4.0. A total of 47 questions were posed with each query being repeated twice per model to evaluate both the accuracy and reproducibility of the responses. The scoring system focused on the accuracy and comprehensiveness of each response. Our findings revealed a notable disparity in the performance of the two models. GPT 4.0 demonstrated higher consistency and accuracy, with 41 out of 47 (87.2%) responses deemed accurate and comprehensive, compared to 36 out of 47 (76.6%) for GPT 3.5. In terms of reproducibility, both models exhibited strong performance: 42 out of 47 (89.3%) for GPT 3.5 and 45 out of 47 for GPT 4.0 (95.7%). When comparing responses between the models, we observed good reproducibility in 38 out of 47 questions (80.8%). A key observation was that GPT 4.0 significantly outperformed its predecessor GPT 3.5 in terms of both accuracy as well as reproducibility within its own responses, indicating a more reliable and consistent performance. The area most lacking in accuracy for both models was lung cancer staging, indicating a need for further refinement in this domain. Another key observation was the models' tendency to incorporate empathetic language, often beginning responses with expressions of sympathy and consistently advising confirmation with a medical professional. Our study underscores the potential and limitations of current AI models in patient education and support, highlighting areas for improvement and the importance of empathetic communication in AI interactions with patients. As the model continues to be trained on a larger and more comprehensive set of data, it is reasonable to anticipate further improvements in its ability to provide precise, detailed, and contextually appropriate responses. Citation Format: Asiyah Allibhai, Ahmed Allibhai, Anthony Brade, Zishan Allibhai. Evaluating the accuracy and reproducibility of ChatGPT models in answering lung cancer patient queries [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl):Abstract nr 1296.

Can the ChatGPT and other Large Language Models with internet-connected database solve the questions and concerns of patient with prostate cancer?

Performance of large language models (LLMs) in providing prostate cancer information

Assessing the performance of large language models (LLMs) in answering medical questions regarding breast cancer in the Chinese context

Large language models encode medical oncology knowledge: Performance on the ASCO and ESMO examination questions.

P717 Evaluating the performance of Large Language Models in responding to patients' health queries: A comparative analysis with medical experts

Performance of large language models on benign prostatic hyperplasia frequently asked questions

The Emerging Role of Large Language Models in Improving Prostate Cancer Literacy

Abstract 1296: Evaluating the accuracy and reproducibility of ChatGPT models in answering lung cancer patient queries

Large language model answers medical questions about standard pathology reports

The performance of large language model powered chatbots compared to oncology physicians on colorectal cancer queries

Exploring the role of artificial intelligence, large language models: Comparing patient‐focused information and clinical decision support capabilities to the gynecologic oncology guidelines

Accuracy, readability, and understandability of large language models for prostate cancer information to the public

Implementing large language model-based artificial intelligence (AI) technology in proposing effective treatment plans in patients with cancer.

Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study

Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study

Amplifying Chinese physicians' emphasis on patients' psychological states beyond urologic diagnoses with ChatGPT-A multi-center cross-sectional study

Leveraging Large Language Models for Improved Patient Access and Self-Management: Assessor-Blinded Comparison Between Expert- and AI-Generated Content

Do Large Language Model Chatbots perform better than established patient information resources in answering patient questions? A comparative study on melanoma

Can Large Language Models Aid Caregivers of Pediatric Cancer Patients in Information Seeking? A Cross-Sectional Investigation

Leveraging Large Language Models for Improved Patient Access and Self-Management in Oral Healthcare: an Assessor-blinded Preclinical Study (Preprint)

Utility of Large Language Models for Health Care Professionals and Patients in Navigating Hematopoietic Stem Cell Transplantation: Comparison of the Performance of ChatGPT-3.5, ChatGPT-4, and Bard