Abstract:Background: Large language models (LLMs) have shown promising performance in various healthcare domains, but their effectiveness in identifying specific clinical conditions in real medical records is less explored. This study evaluates LLMs for detecting signs of cognitive decline in real electronic health record (EHR) clinical notes, comparing their error profiles with traditional models. The insights gained will inform strategies for performance enhancement. Methods: This study, conducted at Mass General Brigham in Boston, MA, analyzed clinical notes from the four years prior to a 2019 diagnosis of mild cognitive impairment in patients aged 50 and older. We used a randomly annotated sample of 4,949 note sections, filtered with keywords related to cognitive functions, for model development. For testing, a random annotated sample of 1,996 note sections without keyword filtering was utilized. We developed prompts for two LLMs, Llama 2 and GPT-4, on HIPAA-compliant cloud-computing platforms using multiple approaches (e.g., both hard and soft prompting and error analysis-based instructions) to select the optimal LLM-based method. Baseline models included a hierarchical attention-based neural network and XGBoost. Subsequently, we constructed an ensemble of the three models using a majority vote approach. Results: GPT-4 demonstrated superior accuracy and efficiency compared to Llama 2, but did not outperform traditional models. The ensemble model outperformed the individual models, achieving a precision of 90.3%, a recall of 94.2%, and an F1-score of 92.2%. Notably, the ensemble model showed a significant improvement in precision, increasing from a range of 70%-79% to above 90%, compared to the best-performing single model. Error analysis revealed that 63 samples were incorrectly predicted by at least one model; however, only 2 cases (3.2%) were mutual errors across all models, indicating diverse error profiles among them. Conclusions: LLMs and traditional machine learning models trained using local EHR data exhibited diverse error profiles. The ensemble of these models was found to be complementary, enhancing diagnostic performance. Future research should investigate integrating LLMs with smaller, localized models and incorporating medical data and domain knowledge to enhance performance on specific tasks.

LCD Benchmark: Long Clinical Document Benchmark on Mortality Prediction for Language Models

LCD Benchmark: Long Clinical Document Benchmark on Mortality Prediction

Enhancing In-Hospital Mortality Prediction Using Multi-Representational Learning with LLM-Generated Expert Summaries

Enhancing Early Detection of Cognitive Decline in the Elderly: A Comparative Study Utilizing Large Language Models in Clinical Notes

How Long Is Enough? Exploring the Optimal Intervals of Long-Range Clinical Note Language Modeling

CliBench: A Multifaceted and Multigranular Evaluation of Large Language Models for Clinical Decision Making

Optimizing Large Language Models for Discharge Prediction: Best Practices in Leveraging Electronic Health Record Audit Logs

Benchmark of Deep Learning Models on Large Healthcare MIMIC Datasets

Multimodal temporal-clinical note network for mortality prediction

Is larger always better? Evaluating and prompting large language models for non-generative medical tasks

ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?

Benchmarking Large Language Models for Extraction of International Classification of Diseases Codes from Clinical Documentation

Abstract 13965: Natural Language Processing of Hospitalization Discharge Summary to Predict 1-year Post-Discharge Mortality Among Patients With Acute Heart Failure

A Dataset and Benchmark for Hospital Course Summarization with Adapted Large Language Models

Health system-scale language models are all-purpose prediction engines

Evaluating and Mitigating Limitations of Large Language Models in Clinical Decision Making

Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark

Benchmarking the Confidence of Large Language Models in Clinical Questions

CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios

ClinicalMamba: A Generative Clinical Language Model on Longitudinal Clinical Notes