Comparative analysis of ChatGPT-4.0's management of six gastrointestinal cancers according to the NCCN guidelines.
Tamir E. Bresler,Tyler Wilson,Zin Htway,Samuel Slomowitz,Manabu Fujita
DOI: https://doi.org/10.1200/jco.2024.42.16_suppl.e13654
IF: 45.3
2024-05-31
Journal of Clinical Oncology
Abstract:e13654 Background: The integration of Natural Language Processing (NLP) into healthcare holds tremendous promise. ChatGPT-4.0 (OpenAI, San Francisco, CA) is a widely recognized large language model that can comprehend and generate text, answer questions, and perform other language-related tasks. However, pitfalls and errors have been described in its clinical application. We explored the ability of ChatGPT-4.0 (ChatGPT) to guide clinical decision-making in 6 gastrointestinal cancers using the National Comprehensive Cancer Network (NCCN) Clinical Practice Guidelines as a framework. Methods: We reviewed the NCCN Guidelines for Anal Squamous Cell Carcinoma (AN), Small Bowel Adenocarcinoma (SB), Ampullary Adenocarcinoma (AA), Biliary Tract Cancers (BT), Pancreatic Adenocarcinoma (PN), and Gastric Cancer (GA). Up to 2 clinical questions were designed for each decision-making page. Questions were categorized by type ( Wo rkup, Treatment, Surveillance, Diagnostics, or References). ChatGPT was queried in a reproducible fashion. To account for variable prompt engineering of our non-validated assessment tool, up to 3 follow-up questions were allowed. Responses were rated on a Likert scale: 5) Correct; 4) Correct, with missing information requiring clarification; 3) Correct, but unable to complete answer; 2) Partially incorrect; 1) Absolutely incorrect. Subgroup analysis was conducted on Correctness (defined as scores 1-2 vs 3-5) and Accuracy (scores 1-3 vs 4-5). Variance between ChatGPT responses to each cancer was analyzed. Descriptive statistics were used, and significance was tested with binary logistic regression. Results: A total of 270 questions were generated (range per cancer 32-68). The score frequency distribution was: 5) 45.2%; 4) 19.3%; 3) 13.3%; 2) 13.7%; and 1) 8.5%. On subgroup analysis, Correctness was seen in 210 (77.8%) of questions, and Accuracy with 174 (64.4%). The difference in Correctness scores between cancers was not statistically significant, and there was no statistically significant difference in scores by question type. There was a statistically significant difference in the Accuracy of ChatGPT between cancers (Table). Conclusions: ChatGPT was significantly more likely to provide accurate responses to questions regarding GA and PN versus AN or SB. It demonstrates a limited capacity to assist with complex clinical decision-making in 6 gastrointestinal cancers. However, the Accuracy level is below the acceptable threshold for implementation into clinical use. Further analysis of the expanding capabilities of ChatGPT and other NLP-based tools is warranted in this rapidly evolving domain. Future studies would benefit from a validating grading instrument. [Table: see text]
oncology