Abstract:Social Determinants of Health (SDoH) are an important part of the exposome and are known to have a large impact on variation in health outcomes. In particular, housing stability is known to be intricately linked to a patient’s health status, and pregnant women experiencing housing instability (HI) are known to have worse health outcomes. Most SDoH information is stored in electronic health records (EHRs) as free text (unstructured) clinical notes, which traditionally required natural language processing (NLP) for automatic identification of relevant text or keywords. A patient’s housing status can be ambiguous or subjective, and can change from note to note or within the same note, making it difficult to use existing NLP solutions. New developments in NLP allow researchers to prompt LLMs to perform complex, subjective annotation tasks that require reasoning that previously could only be attempted by human annotators. For example, large language models (LLMs) such as GPT (Generative Pre-trained Transformer) enable researchers to analyze complex, unstructured data using simple prompts. We used a secure platform within a large healthcare system to compare the ability of GPT-3.5 and GPT-4 to identify instances of both current and past housing instability, as well as general housing status, from 25,217 notes from 795 pregnant women. Results from these LLMs were compared with results from manual annotation, a named entity recognition (NER) model, and regular expressions (RegEx). We developed a chain-of-thought prompt requiring evidence and justification for each note from the LLMs, to help maximize the chances of finding relevant text related to HI while minimizing hallucinations and false positives. Compared with GPT-3.5 and the NER model, GPT-4 had the highest performance and had a much higher recall (0.924) than human annotators (0.702) in identifying patients experiencing current or past housing instability, although precision was lower (0.850) compared with human annotators (0.971). In most cases, the evidence output by GPT-4 was similar or identical to that of human annotators, and there was no evidence of hallucinations in any of the outputs from GPT-4. Most cases where the annotators and GPT-4 differed were ambiguous or subjective, such as “living in an apartment with too many people”. We also looked at GPT-4 performance on de-identified versions of the same notes and found that precision improved slightly (0.936 original, 0.939 de-identified), while recall dropped (0.781 original, 0.704 de-identified). This work demonstrates that, while manual annotation is likely to yield slightly more accurate results overall, LLMs, when compared with manual annotation, provide a scalable, cost-effective solution with the advantage of greater recall. At the same time, further evaluation is needed to address the risk of missed cases and bias in the initial selection of housing-related notes. Additionally, while it was possible to reduce confabulation, signs of unusual justifications remained. Given these factors, together with changes in both LLMs and charting over time, this approach is not yet appropriate for use as a fully-automated process. However, these results demonstrate the potential for using LLMs for computer-assisted annotation with human review, reducing cost and increasing recall. More efficient methods for obtaining structured SDoH data can help accelerate inclusion of exposome variables in biomedical research, and support healthcare systems in identifying patients who could benefit from proactive outreach.

Harnessing generative AI to annotate the severity of all phenotypic abnormalities within the Human Phenotype Ontology

Improving Automated Deep Phenotyping Through Large Language Models Using Retrieval Augmented Generation

Assessing the utility of large language models for phenotype-driven gene prioritization in the diagnosis of rare genetic disease

Assessing the Utility of Large Language Models for Phenotype-Driven Gene Prioritization in Rare Genetic Disorder Diagnosis

High-Throughput Phenotyping of Clinical Text Using Large Language Models

A Large Language Model Outperforms Other Computational Approaches to the High-Throughput Phenotyping of Physician Notes

Automating Clinical Phenotyping Using Natural Language Processing: An Application for Crohn's Disease

An evaluation of GPT models for phenotype concept recognition

Leveraging Generative AI to Accelerate Biocuration of Medical Actions for Rare Disease

Identifying and Extracting Rare Diseases and Their Phenotypes with Large Language Models

On the limitations of large language models in clinical diagnosis

PhenoID, a language model normalizer of physical examinations from genetics clinical notes

Using Large Language Models to Annotate Complex Cases of Social Determinants of Health in Longitudinal Clinical Records

Large language models facilitate the generation of electronic health record phenotyping algorithms

Assessing DxGPT: Diagnosing Rare Diseases with Various Large Language Models

Identifying and Extracting Rare Disease Phenotypes with Large Language Models

Diagnostic Accuracy of a Custom Large Language Model on Rare Pediatric Disease Case Reports

Large Language Models with Retrieval-Augmented Generation for Zero-Shot Disease Phenotyping

Enhancing Phenotype Recognition in Clinical Notes Using Large Language Models: PhenoBCBERT and PhenoGPT