Scalable information extraction from free text electronic health records using large language models

Bowen Gu,Vivian Shao,Ziqian Liao,Valentina Carducci,Santiago Romero-Brufau,Jie Yang,Rishi Desai

DOI: https://doi.org/10.1101/2024.08.08.24311237

2024-08-10

Abstract:Background: A vast amount of potentially useful information such as description of patient symptoms, family, and social history is recorded as free-text notes in electronic health records (EHRs) but is difficult to reliably extract at scale, limiting their utility in research. This study aims to assess whether an "out of the box" implementation of open-source large language models (LLMs) without any fine-tuning can accurately extract social determinants of health (SDoH) data from free-text clinical notes. Methods: We conducted a cross-sectional study using EHR data from the Mass General Brigham (MGB) system, analyzing free-text notes for SDoH information. We selected a random sample of 200 patients and manually labeled nine SDoH aspects. Eight advanced open-source LLMs were evaluated against a baseline pattern-matching model. Two human reviewers provided the manual labels, achieving 93% inter-annotator agreement. LLM performance was assessed using accuracy metrics for overall, mentioned, and non-mentioned SDoH, and macro F1 scores. Results: LLMs outperformed the baseline pattern-matching approach, particularly for explicitly mentioned SDoH, achieving up to 40% higher Accuracy_mentioned. openchat_3.5 was the best performing model, surpassing the baseline in overall accuracy across all nine SDoH aspects. The refined pipeline with prompt engineering reduced hallucinations and improved accuracy. Conclusions: Open-source LLMs are effective and scalable tools for extracting SDoH from unstructured EHRs, surpassing traditional pattern-matching methods. Further refinement and domain-specific training could enhance their utility in clinical research and predictive analytics, improving healthcare outcomes and addressing health disparities

Health Informatics

What problem does this paper attempt to address?

The paper aims to address the problem of extracting Social Determinants of Health (SDoH) information from free-text annotations in Electronic Health Records (EHR). Specifically, the researchers hope to accurately extract SDoH data from clinical notes using open-source large language models (LLMs) without any task-specific training. This work aims to overcome the limitations of traditional methods (such as rule-based approaches or machine learning methods that require extensive manual annotation) in terms of scalability and generalizability. The main objectives of the study include: 1. **Evaluating the capabilities of LLMs**: Assessing whether multiple open-source LLMs can effectively extract SDoH information from free-text EHRs without fine-tuning. 2. **Comparing performance**: Comparing LLMs with traditional pattern-matching methods to verify the superiority of LLMs in extracting SDoH information. 3. **Improving methods**: Reducing model hallucinations and improving accuracy through prompt engineering and post-processing techniques. Ultimately, the study found that LLMs outperformed traditional pattern-matching methods in extracting explicitly mentioned SDoH information, especially on certain specific SDoH issues. Additionally, the research highlighted the importance of prompt engineering and the significant differences among various LLMs in information extraction tasks. These findings are of great significance for clinical research and predictive analytics, helping to better manage and improve patients' social health conditions.

Scalable information extraction from free text electronic health records using large language models

Large language models to identify social determinants of health in electronic health records

Large Language Models for Social Determinants of Health Information Extraction from Clinical Notes - A Generalizable Approach across Institutions

Critical Care Studies Using Large Language Models Based on Electronic Healthcare Records: A Technical Note

Unlocking the Potential of Free Text in Electronic Health Records with Large Language Models (LLM): Enhancing Patient Safety and Consultation Interactions

A scoping review of using Large Language Models (LLMs) to investigate Electronic Health Records (EHRs)

Evaluating Large Language Models for Public Health Classification and Extraction Tasks

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

Large Language Models for Integrating Social Determinant of Health Data: A Case Study on Heart Failure 30-Day Readmission Prediction

From Text to Tables: A Local Privacy Preserving Large Language Model for Structured Information Retrieval from Medical Documents

Pharmacological rebound: a tool in the evaluation of antispasticity drugs.

Leveraging natural language processing to augment structured social determinants of health data in the electronic health record

Privacy-preserving large language models for structured medical information retrieval

A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics

Large language models in medical and healthcare fields: applications, advances, and challenges

Enhancing Health Data Interoperability with Large Language Models: A FHIR Study

Large language models for accurate disease detection in electronic health records

Tolerance to tacrine, arterial hypotension and leuko-araiosis in Alzheimer's disease.

Understanding the concerns and choices of public when using large language models for healthcare

Assessing equitable use of large language models for clinical decision support in real-world settings: fine-tuning and internal-external validation using electronic health records from South Asia

Large language models for extracting histopathologic diagnoses from electronic health records