Large language models for accurate disease detection in electronic health records

Nils Burgisser,Etienne Chalot,Samia Mehouachi,Clement P. Buclin,Kim Lauper,Delphine Sophie Courvoisier,Denis Mongin

DOI: https://doi.org/10.1101/2024.07.27.24311106

2024-07-29

Abstract:Importance: The use of large language models (LLMs) in medicine is increasing, with potential applications in electronic health records (EHR) to create patient cohorts or identify patients who meet clinical trial recruitment criteria. However, significant barriers remain, including the extensive computer resources required, lack of performance evaluation, and challenges in implementation. Objective: This study aims to propose and test a framework to detect disease diagnosis using a recent light LLM on French-language EHR documents. Specifically, it focuses on detecting gout ( in French), a ubiquitous French term that have multiple meanings beyond the disease. The study will compare the performance of the LLM-based framework with traditional natural language processing techniques and test its dependence on the parameter used. Design: The framework was developed using a training and testing set of 700 paragraphs assessing , issued from a random selection of retrospective EHR documents. All paragraphs were manually reviewed and classified by two health-care professionals (HCP) into disease (true gout) and non-disease (gold standard). The LLM's accuracy was tested using few-shot and chain-of-thought prompting and compared to a regular expression (regex)-based method, focusing on the effects of model parameters and prompt structure. The framework was further validated on 600 paragraphs assessing . Setting: The documents were sampled from the electronic health-records of a tertiary university hospital in Geneva, Switzerland. Participants: Adults over 18 years of age. Exposure: Meta's Llama 3 8B LLM or traditional method, against a gold standard. Main Outcomes and Measures: Positive and negative predictive value, as well as accuracy of tested models. Results: The LLM-based algorithm outperformed the regex method, achieving a 92.7% [88.7-95.4%] positive predictive value, a 96.6% [94.6-97.8%] negative predictive value, and an accuracy of 95.4% [93.6-96.7%] for gout. In the validation set on CPPD, accuracy was 94.1% [90.2-97.6%]. The LLM framework performed well over a wide range of parameter values. Conclusions and Relevance: LLMs were able to accurately detect disease diagnoses from EHRs, even in non-English languages. They could facilitate creating large disease registries in any language, improving disease care assessment and patient recruitment for clinical trials.

Rheumatology

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to evaluate the accuracy and efficiency of large - language models (LLMs) in detecting diseases from unstructured electronic health record (EHR) texts, compared with traditional natural - language processing techniques. Specifically, the research aims to: 1. **Propose and test a lightweight LLM - based framework** for detecting disease diagnoses from French EHR documents. In particular, focus on gout, because the word "gout" has multiple meanings in French, not limited to the disease. 2. **Compare the performance of the LLM framework with traditional natural - language processing techniques**, especially in detecting gout and calcium pyrophosphate deposition disease (CPPD). 3. **Evaluate the dependence of the LLM framework on different parameters** to ensure its robustness and applicability. ### Key issues - **Accuracy**: Can LLMs accurately identify the diagnosis of specific diseases in unstructured texts? - **Efficiency**: Can LLMs perform this task more efficiently compared with traditional methods? - **Robustness**: How does the LLM framework perform under different parameter settings? ### Research background With the increasing application of large - language models in the medical field, especially in electronic health records, researchers hope to use these models to create patient cohorts or identify patients who meet the recruitment criteria for clinical trials. However, there are still some obstacles, such as the required computing resources, lack of performance evaluation, and implementation challenges. ### Research design - **Data sources**: EHR documents from the University Hospital of Geneva, including 700 texts containing "gout" and 600 texts containing terms related to "CPPD". - **Gold standard**: Manually reviewed and classified by two medical professionals to determine whether it is a true disease diagnosis. - **Comparison methods**: Use Meta's Llama 3 8B LLM and the regular expression (regex) method for comparison. ### Main results - **Accuracy**: The LLM framework achieved an overall accuracy of 95.4% in detecting gout, with a positive predictive value (PPV) of 92.7% and a negative predictive value (NPV) of 96.6%. - **Robustness**: The LLM framework performs excellently within a wide range of parameters and has stable performance. - **Validation**: In CPPD detection, the LLM framework also performs well, with an accuracy of 94.1%. ### Conclusion LLMs can accurately detect disease diagnoses from EHRs, and can perform well even in non - English languages. This provides new possibilities for creating automated EHR registries or simplifying the recruitment of patients for clinical trials.

Large language models for accurate disease detection in electronic health records

Large language models for extracting histopathologic diagnoses from electronic health records

Critical Care Studies Using Large Language Models Based on Electronic Healthcare Records: A Technical Note

Large language models to identify social determinants of health in electronic health records

LLMD: A Large Language Model for Interpreting Longitudinal Medical Records

Transformative potential of Large Language Models in data mining on Electronic Health Records.

Enhancing Early Detection of Cognitive Decline in the Elderly: A Comparative Study Utilizing Large Language Models in Clinical Notes

Optimizing large language models in digestive disease: strategies and challenges to improve clinical outcomes

Evaluating large language models on medical, lay language, and self-reported descriptions of genetic conditions

Scalable information extraction from free text electronic health records using large language models

Large Language Model-Driven Evaluation of Medical Records Using MedCheckLLM

Large Language Models for Disease Diagnosis: A Scoping Review

Large Language Models with Retrieval-Augmented Generation for Zero-Shot Disease Phenotyping

Generative Large Language Models in Electronic Health Records for Patient Care Since 2023: A Systematic Review

Evaluating Large Language Models for Public Health Classification and Extraction Tasks

Evaluating large language models in medical applications: a survey

Is larger always better? Evaluating and prompting large language models for non-generative medical tasks

Evaluating large language model workflows in clinical decision support: referral, triage, and diagnosis

Potential of Large Language Models in Health Care: Delphi Study

From Text to Tables: A Local Privacy Preserving Large Language Model for Structured Information Retrieval from Medical Documents