Filling the gaps: leveraging large language models for temporal harmonization of clinical text across multiple medical visits for clinical prediction

Inyoung Choi,Qi Long,Emily Getzen
DOI: https://doi.org/10.1101/2024.05.06.24306959
2024-05-07
Abstract:Electronic health records offer great promise for early disease detection, treatment evaluation, information discovery, and other important facets of precision health. Clinical notes, in particular, may contain nuanced information about a patient’s condition, treatment plans, and history that structured data may not capture. As a result, and with advancements in natural language processing, clinical notes have been increasingly used in supervised prediction models. To predict long-term outcomes such as chronic disease and mortality, it is often advantageous to leverage data occurring at multiple time points in a patient’s history. However, these data are often collected at irregular time intervals and varying frequencies, thus posing an analytical challenge. Here, we propose the use of large language models (LLMs) for robust temporal harmonization of clinical notes across multiple visits. We compare multiple state-of-the-art LLMs in their ability to generate useful information during time gaps, and evaluate performance in supervised deep learning models for clinical prediction.
Intensive Care and Critical Care Medicine
What problem does this paper attempt to address?
This paper mainly discusses how to use large language models (LLMs) to address the irregularity of time series data in electronic health records (EHRs) and improve the accuracy of clinical predictions. The researchers noticed that due to inconsistent patient visit intervals, EHR data poses analytical challenges that may result in inaccurate predictions by machine learning models for long-term outcomes. To tackle this issue, they propose utilizing LLMs to generate useful information within the time intervals, thereby enhancing the temporal structure of clinical notes. The paper introduces several traditional approaches for handling irregular time series data, such as zero filling, last observation carried forward (LOCF), and multimodal imputation. Then, they suggest using LLMs, particularly those specifically trained on biological and clinical data, to generate missing doctor's note text and fill in the time intervals. By feeding the enhanced temporal structure into a supervised deep learning model, the authors predict the mortality rate of intensive care unit/emergency department patients within a year and compare it with existing methods. The experimental results show that GPT-4 (an advanced LLM) performs the best in terms of AUC and F1 scores in both zero-shot learning and one-shot learning settings compared to other methods (including multimodal imputation and LOCF). Specifically, for patients with a large amount of missing data, filling the gaps with GPT-4 significantly improves model performance. Furthermore, the study also finds that GPT-4 can enhance algorithm fairness for patient populations with different data completeness, as it can strengthen the EHR of patients with incomplete data. The paper concludes by discussing the potential limitations of LLMs, such as inadequate interpretability and possible "hallucination" outputs, and suggests that future research should focus on improving the interpretability of LLMs in the medical field and reducing erroneous predictions.