Abstract:The consistent and persuasive evidence illustrating the influence of social determinants on health has prompted a growing realization throughout the health care sector that enhancing health and health equity will likely depend, at least to some extent, on addressing detrimental social determinants. However, detailed social determinants of health (SDoH) information is often buried within clinical narrative text in electronic health records (EHRs), necessitating natural language processing (NLP) methods to automatically extract these details. Most current NLP efforts for SDoH extraction have been limited, investigating on limited types of SDoH elements, deriving data from a single institution, focusing on specific patient cohorts or note types, with reduced focus on generalizability. This study aims to address these issues by creating cross-institutional corpora spanning different note types and healthcare systems, and developing and evaluating the generalizability of classification models, including novel large language models (LLMs), for detecting SDoH factors from diverse types of notes from four institutions: Harris County Psychiatric Center, University of Texas Physician Practice, Beth Israel Deaconess Medical Center, and Mayo Clinic. Four corpora of deidentified clinical notes were annotated with 21 SDoH factors at two levels: level 1 with SDoH factor types only and level 2 with SDoH factors along with associated values. Three traditional classification algorithms (XGBoost, TextCNN, Sentence BERT) and an instruction tuned LLM-based approach (LLaMA) were developed to identify multiple SDoH factors. Substantial variation was noted in SDoH documentation practices and label distributions based on patient cohorts, note types, and hospitals. The LLM achieved top performance with micro-averaged F1 scores over 0.9 on level 1 annotated corpora and an F1 over 0.84 on level 2 annotated corpora. While models performed well when trained and tested on individual datasets, cross-dataset generalization highlighted remaining obstacles. To foster collaboration, access to partial annotated corpora and models trained by merging all annotated datasets will be made available on the PhysioNet repository.

Towards Holistic Disease Risk Prediction using Small Language Models

Clinical Risk Prediction Using Language Models: Benefits And Considerations

Large Language Multimodal Models for 5-Year Chronic Disease Cohort Prediction Using EHR Data

Health system-scale language models are all-purpose prediction engines

Uncertainty Quantification for Clinical Outcome Predictions with (Large) Language Models

Large Language Models for Social Determinants of Health Information Extraction from Clinical Notes - A Generalizable Approach across Institutions

Poem: lichenplanus.

Multimodal Learning for Cardiovascular Risk Prediction using EHR Data

Filling the gaps: leveraging large language models for temporal harmonization of clinical text across multiple medical visits for clinical prediction

Health-LLM: Large Language Models for Health Prediction via Wearable Sensor Data

Efficient and Personalized Mobile Health Event Prediction via Small Language Models

Prompting Large Language Models for Zero-Shot Clinical Prediction with Structured Longitudinal Electronic Health Record Data

EHR-Based Mobile and Web Platform for Chronic Disease Risk Prediction Using Large Language Multimodal Models

Multimodal Large Language Models in Health Care: Applications, Challenges, and Future Outlook

Redefining Digital Health Interfaces with Large Language Models

LLMs-based Few-Shot Disease Predictions using EHR: A Novel Approach Combining Predictive Agent Reasoning and Critical Agent Instruction

Health-LLM: Personalized Retrieval-Augmented Disease Prediction Model

Longitudinal modeling of multimorbidity trajectories using large language models

Large Language Models for Integrating Social Determinant of Health Data: A Case Study on Heart Failure 30-Day Readmission Prediction

Large Language Models with Retrieval-Augmented Generation for Zero-Shot Disease Phenotyping

A Hybrid Framework with Large Language Models for Rare Disease Phenotyping