Abstract:Background: Large language models (LLMs) have shown promising performance in various healthcare domains, but their effectiveness in identifying specific clinical conditions in real medical records is less explored. This study evaluates LLMs for detecting signs of cognitive decline in real electronic health record (EHR) clinical notes, comparing their error profiles with traditional models. The insights gained will inform strategies for performance enhancement. Methods: This study, conducted at Mass General Brigham in Boston, MA, analyzed clinical notes from the four years prior to a 2019 diagnosis of mild cognitive impairment in patients aged 50 and older. We used a randomly annotated sample of 4,949 note sections, filtered with keywords related to cognitive functions, for model development. For testing, a random annotated sample of 1,996 note sections without keyword filtering was utilized. We developed prompts for two LLMs, Llama 2 and GPT-4, on HIPAA-compliant cloud-computing platforms using multiple approaches (e.g., both hard and soft prompting and error analysis-based instructions) to select the optimal LLM-based method. Baseline models included a hierarchical attention-based neural network and XGBoost. Subsequently, we constructed an ensemble of the three models using a majority vote approach. Results: GPT-4 demonstrated superior accuracy and efficiency compared to Llama 2, but did not outperform traditional models. The ensemble model outperformed the individual models, achieving a precision of 90.3%, a recall of 94.2%, and an F1-score of 92.2%. Notably, the ensemble model showed a significant improvement in precision, increasing from a range of 70%-79% to above 90%, compared to the best-performing single model. Error analysis revealed that 63 samples were incorrectly predicted by at least one model; however, only 2 cases (3.2%) were mutual errors across all models, indicating diverse error profiles among them. Conclusions: LLMs and traditional machine learning models trained using local EHR data exhibited diverse error profiles. The ensemble of these models was found to be complementary, enhancing diagnostic performance. Future research should investigate integrating LLMs with smaller, localized models and incorporating medical data and domain knowledge to enhance performance on specific tasks.

WangLab at MEDIQA-CORR 2024: Optimized LLM-based Programs for Medical Error Detection and Correction

PromptMind Team at MEDIQA-CORR 2024: Improving Clinical Text Correction with Error Categorization and LLM Ensembles

Edinburgh Clinical NLP at MEDIQA-CORR 2024: Guiding Large Language Models with Hints

Maven at MEDIQA-CORR 2024: Leveraging RAG and Medical LLM for Error Detection and Correction in Medical Notes

A Framework to Assess Clinical Safety and Hallucination Rates of LLMs for Medical Text Summarisation

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

Chain-of-Though (CoT) prompting strategies for medical error detection and correction

MediFact at MEDIQA-CORR 2024: Why AI Needs a Human Touch

IryoNLP at MEDIQA-CORR 2024: Tackling the Medical Error Detection & Correction Task On the Shoulders of Medical Agents

HSE NLP Team at MEDIQA-CORR 2024 Task: In-Prompt Ensemble with Entities and Knowledge Graph for Medical Error Correction

WangLab at MEDIQA-Chat 2023: Clinical Note Generation from Doctor-Patient Conversations using Large Language Models

WangLab at MEDIQA-M3G 2024: Multimodal Medical Answer Generation using Large Language Models

Improving postsurgical fall detection for older Americans using LLM-driven analysis of clinical narratives

Agentic LLM Workflows for Generating Patient-Friendly Medical Reports

The Sound of Healthcare: Improving Medical Transcription ASR Accuracy with Large Language Models

Integrating Knowledge Retrieval and Large Language Models for Clinical Report Correction

Evaluating LLMs at Detecting Errors in LLM Responses

Enhancing Early Detection of Cognitive Decline in the Elderly: A Comparative Study Utilizing Large Language Models in Clinical Notes

Enhancing Small Medical Learners with Privacy-preserving Contextual Prompting

The Need for Guardrails with Large Language Models in Medical Safety-Critical Settings: An Artificial Intelligence Application in the Pharmacovigilance Ecosystem

UMASS_BioNLP at MEDIQA-Chat 2023: Can LLMs generate high-quality synthetic note-oriented doctor-patient conversations?