Knowledge Enhanced Conditional Imputation for Healthcare Time-series

Linglong Qian,Joseph Arul Raj,Hugh Logan Ellis,Ao Zhang,Yuezhou Zhang,Tao Wang,Richard JB Dobson,Zina Ibrahim
2024-09-30
Abstract:We present an end-to-end architecture for managing complex missingness in multivariate time series derived from hospital electronic health records (EHRs). Our Conditional Self-Attention Imputation (CSAI) is a recurrent neural network architecture equipped with a number of techniques aiming to improve imputation accuracy by aligning the model with the subtle temporal and spatial dependencies typical of clinical data. CSAI a) utilises an attention-based hidden state initialisation to capture long- and short-range correlations within the time-series, b) incorporates a knowledge embedding technique to capture clinical data recording patterns and c) employs a non-uniform masking strategy to adapt its weights to data temporal and cross-sectional missingness patterns. Extensive evaluation of three EHR benchmark data sets demonstrates that CSAI enhances the current state of the art efficacy in data restoration in addition to performance on downstream tasks. Furthermore, CSAI is integrated within the PyPOTS Python library for benchmarking, offering open and standardised benchmarking capabilities and ease of use for researchers.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the complex missing problems in multivariate time - series data in electronic health records (EHR). Specifically, the paper proposes a new architecture named **Conditional Self - Attention Imputation (CSAI)** to improve the deficiencies of existing methods in handling medical time - series data. #### Main problems include: 1. **Complex missing patterns**: - There are a large number of missing values in EHR data, and these missing values are not randomly distributed. The missing patterns of different features are affected by clinical and administrative decisions, resulting in complex and diverse missing patterns. - For example, heart rate is usually monitored frequently, while white blood cell count is only measured in specific situations, such as when an infection is suspected. 2. **Spatio - temporal dependencies**: - Medical time - series data has complex spatio - temporal dependencies, that is, the correlations between different features and the dependencies in time. Existing models have difficulty in capturing both short - term and long - term time dependencies simultaneously. - For example, the blood - glucose fluctuations and long - term HbA1c levels in diabetic patients need to be modeled simultaneously; cardiovascular risk assessment needs to comprehensively consider biomarkers such as cholesterol, blood - glucose and blood pressure. 3. **The importance of domain knowledge**: - Existing models fail to fully consider the influence of domain knowledge on feature recording patterns. For example, the correlation between hypertension and nephropathy means that there is a high correlation between the recording patterns of blood pressure and urine creatinine levels. - This domain knowledge is crucial for improving imputation accuracy. 4. **Limitations of random masking strategies**: - Existing models usually use random masking to generate "true" data for evaluation, which simplifies the spatio - temporal dependencies in the actual EHR time - series, resulting in inaccurate evaluation results. ### CSAI's solutions To address the above challenges, CSAI proposes the following improvements: 1. **Conditional hidden state initialization**: - Use the self - attention mechanism to initialize the hidden state to better capture short - and long - term time dependencies. - The formula is expressed as: \[ h_{\text{init}}=\text{Conv1D}_2(H_1W_2 + b_2) \] where \(H_1 = \text{Conv1D}_1(C_{\text{out}}W_1 + b_1)\), and \(C_{\text{out}}\) is the output processed by the multi - head self - attention mechanism. 2. **Domain - informed time - decay function**: - Introduce a time - decay function based on domain knowledge to adjust the association weights between missing values and past observations according to the recording patterns of features. - The formula is expressed as: \[ A_t=\exp(-\max(0, W_\gamma(\delta_t-\tau)+b_\gamma)) \] where \(\tau\) is the median of the time interval between two recordings of a feature. 3. **Non - uniform masking strategy**: - Design a non - uniform masking strategy to simulate the inherent structured missing patterns in EHR data and avoid the simplification problems caused by random masking. - The non - uniform masking probability \(P_{\text{nu}}(d)\) is calculated as: \[ P_{\text{nu}}(d)=R_{\text{factor}}(d|U, I)\times P_{\text{dist}}(d) \] Through these improvements, CSAI has demonstrated superior imputation performance on multiple benchmark datasets, especially when dealing with data with a low missing rate.