Hao Zhang,Neil Jethani,Simon Jones,Nicholas Genes,Vincent J Major,Ian S Jaffe,Anthony B Cardillo,Noah Heilenbach,Nadia Fazal Ali,Luke J Bonanni,Andrew J Clayburn,Zain Khera,Erica C Sadler,Jaideep Prasad,Jamie Schlacter,Kevin Liu,Benjamin Silva,Sophie Montgomery,Eric J Kim,Jacob Lester,Theodore M Hill,Alba Avoricani,Ethan Chervonski,James Davydov,William Small,Eesha Chakravartty,Himanshu Grover,John A Dodson,Abraham A Brody,Yindalon Aphinyanaphongs,Arjun Masurkar,Narges Razavian

Abstract:Importance: Large language models (LLMs) are crucial for medical tasks. Ensuring their reliability is vital to avoid false results. Our study assesses two state-of-the-art LLMs (ChatGPT and LlaMA-2) for extracting clinical information, focusing on cognitive tests like MMSE and CDR. Objective: Evaluate ChatGPT and LlaMA-2 performance in extracting MMSE and CDR scores, including their associated dates. Methods: Our data consisted of 135,307 clinical notes (Jan 12th, 2010 to May 24th, 2023) mentioning MMSE, CDR, or MoCA. After applying inclusion criteria 34,465 notes remained, of which 765 underwent ChatGPT (GPT-4) and LlaMA-2, and 22 experts reviewed the responses. ChatGPT successfully extracted MMSE and CDR instances with dates from 742 notes. We used 20 notes for fine-tuning and training the reviewers. The remaining 722 were assigned to reviewers, with 309 each assigned to two reviewers simultaneously. Inter-rater-agreement (Fleiss' Kappa), precision, recall, true/false negative rates, and accuracy were calculated. Our study follows TRIPOD reporting guidelines for model validation. Results: For MMSE information extraction, ChatGPT (vs. LlaMA-2) achieved accuracy of 83% (vs. 66.4%), sensitivity of 89.7% (vs. 69.9%), true-negative rates of 96% (vs 60.0%), and precision of 82.7% (vs 62.2%). For CDR the results were lower overall, with accuracy of 87.1% (vs. 74.5%), sensitivity of 84.3% (vs. 39.7%), true-negative rates of 99.8% (98.4%), and precision of 48.3% (vs. 16.1%). We qualitatively evaluated the MMSE errors of ChatGPT and LlaMA-2 on double-reviewed notes. LlaMA-2 errors included 27 cases of total hallucination, 19 cases of reporting other scores instead of MMSE, 25 missed scores, and 23 cases of reporting only the wrong date. In comparison, ChatGPT's errors included only 3 cases of total hallucination, 17 cases of wrong test reported instead of MMSE, and 19 cases of reporting a wrong date. Conclusions: In this diagnostic/prognostic study of ChatGPT and LlaMA-2 for extracting cognitive exam dates and scores from clinical notes, ChatGPT exhibited high accuracy, with better performance compared to LlaMA-2. The use of LLMs could benefit dementia research and clinical care, by identifying eligible patients for treatments initialization or clinical trial enrollments. Rigorous evaluation of LLMs is crucial to understanding their capabilities and limitations.

Assessing equitable use of large language models for clinical decision support in real-world settings: fine-tuning and internal-external validation using electronic health records from South Asia

Scalable information extraction from free text electronic health records using large language models

Enhancing Early Detection of Cognitive Decline in the Elderly: A Comparative Study Utilizing Large Language Models in Clinical Notes

A toolbox for surfacing health equity harms and biases in large language models

Bias patterns in the application of LLMs for clinical decision support: A comprehensive study

Impact of Large Language Model Assistance on Patients Reading Clinical Notes: A Mixed-Methods Study

Large language models encode clinical knowledge

Generalization in Healthcare AI: Evaluation of a Clinical Large Language Model

Can Large Language Models Provide Emergency Medical Help Where There Is No Ambulance? A Comparative Study on Large Language Model Understanding of Emergency Medical Scenarios in Resource-Constrained Settings

Evaluating large language model workflows in clinical decision support: referral, triage, and diagnosis

Evaluation of General Large Language Models in Contextually Assessing Semantic Concepts Extracted from Adult Critical Care Electronic Health Record Notes

Large language models to identify social determinants of health in electronic health records

Evaluating Large Language Models in Extracting Cognitive Exam Dates and Scores

Large Language Models Leverage External Knowledge to Extend Clinical Insight Beyond Language Boundaries

Model development for bespoke large language models for digital triage assistance in mental health care

Evaluation and mitigation of cognitive biases in medical language models

Red Teaming Large Language Models in Medicine: Real-World Insights on Model Behavior

Systematic Review of Large Language Models for Patient Care: Current Applications and Challenges

Evaluating and Addressing Demographic Disparities in Medical Large Language Models: A Systematic Review