Abstract:Social Determinants of Health (SDoH) are an important part of the exposome and are known to have a large impact on variation in health outcomes. In particular, housing stability is known to be intricately linked to a patient’s health status, and pregnant women experiencing housing instability (HI) are known to have worse health outcomes. Most SDoH information is stored in electronic health records (EHRs) as free text (unstructured) clinical notes, which traditionally required natural language processing (NLP) for automatic identification of relevant text or keywords. A patient’s housing status can be ambiguous or subjective, and can change from note to note or within the same note, making it difficult to use existing NLP solutions. New developments in NLP allow researchers to prompt LLMs to perform complex, subjective annotation tasks that require reasoning that previously could only be attempted by human annotators. For example, large language models (LLMs) such as GPT (Generative Pre-trained Transformer) enable researchers to analyze complex, unstructured data using simple prompts. We used a secure platform within a large healthcare system to compare the ability of GPT-3.5 and GPT-4 to identify instances of both current and past housing instability, as well as general housing status, from 25,217 notes from 795 pregnant women. Results from these LLMs were compared with results from manual annotation, a named entity recognition (NER) model, and regular expressions (RegEx). We developed a chain-of-thought prompt requiring evidence and justification for each note from the LLMs, to help maximize the chances of finding relevant text related to HI while minimizing hallucinations and false positives. Compared with GPT-3.5 and the NER model, GPT-4 had the highest performance and had a much higher recall (0.924) than human annotators (0.702) in identifying patients experiencing current or past housing instability, although precision was lower (0.850) compared with human annotators (0.971). In most cases, the evidence output by GPT-4 was similar or identical to that of human annotators, and there was no evidence of hallucinations in any of the outputs from GPT-4. Most cases where the annotators and GPT-4 differed were ambiguous or subjective, such as “living in an apartment with too many people”. We also looked at GPT-4 performance on de-identified versions of the same notes and found that precision improved slightly (0.936 original, 0.939 de-identified), while recall dropped (0.781 original, 0.704 de-identified). This work demonstrates that, while manual annotation is likely to yield slightly more accurate results overall, LLMs, when compared with manual annotation, provide a scalable, cost-effective solution with the advantage of greater recall. At the same time, further evaluation is needed to address the risk of missed cases and bias in the initial selection of housing-related notes. Additionally, while it was possible to reduce confabulation, signs of unusual justifications remained. Given these factors, together with changes in both LLMs and charting over time, this approach is not yet appropriate for use as a fully-automated process. However, these results demonstrate the potential for using LLMs for computer-assisted annotation with human review, reducing cost and increasing recall. More efficient methods for obtaining structured SDoH data can help accelerate inclusion of exposome variables in biomedical research, and support healthcare systems in identifying patients who could benefit from proactive outreach.

Accuracy of a Proprietary Large Language Model in Labeling Obstetric Incident Reports

Using Large Language Models to Annotate Complex Cases of Social Determinants of Health in Longitudinal Clinical Records

Zero-shot Interpretable Phenotyping of Postpartum Hemorrhage Using Large Language Models

Large Language Models for Patient Comments Multi-Label Classification

Efficacy of large language models and their potential in Obstetrics and Gynecology education

Comparison of Natural Language Processing of Clinical Notes With a Validated Risk-Stratification Tool to Predict Severe Maternal Morbidity

Evaluating the use of large language models to provide clinical recommendations in the Emergency Department

Evaluating the accuracy of a state-of-the-art large language model for prediction of admissions from the emergency room

The accuracy of large language models in labelling neurosurgical ‘case-control studies’ and risk of bias assessment: protocol for a study of interrater agreement with human reviewers.

Identifying incarceration status in the electronic health record using large language models in emergency department settings

Leveraging Large Language Models in Gynecologic Oncology: A Systematic Review of Current Applications and Challenges

A Machine Learning Approach with Human-AI Collaboration for Automated Classification of Patient Safety Event Reports: Algorithm Development and Validation Study

Evaluating Large Language Models for Public Health Classification and Extraction Tasks

Accuracy, Consistency, and Hallucination of Large Language Models When Analyzing Unstructured Clinical Notes in Electronic Medical Records.

Large Language Model Augmented Clinical Trial Screening

The Foundational Capabilities of Large Language Models in Predicting Postoperative Risks Using Clinical Notes

Enhancing radiology training with GPT-4: Pilot analysis of automated feedback in trainee preliminary reports

A Survey of Clinicians’ Views of the Utility of Large Language Models

The potential and pitfalls of using a large language model such as ChatGPT, GPT-4, or LLaMA as a clinical assistant.

Integrating human expertise & automated methods for a dynamic and multi-parametric evaluation of large language models' feasibility in clinical decision-making