An Integrated Resampling Methods for Imbalanced Sporadic Temporal Data in EHRs

Qi Ye,Tomohiro Kuroda,Tong Ruan,Wenlong Zhang,Xiaoling Ge
DOI: https://doi.org/10.1109/bibm52615.2021.9669865
2021-01-01
Abstract:Most real-world applications in EHRs involve temporal data with skewed distributions. The imbalanced classification problem becomes more difficult in sporadic temporal data that variables exist on correlation and have some missing values. A common solution to classification tasks with imbalanced data is the oversampling methods, which generate new samples to re-balancing the classes. However, traditional oversampling methods usually change the distribution, thereby leading to bias. This paper proposed a self-adaptive integrated oversampling method for imbalanced sporadic temporal data in EHRs. The masking vectors and density vectors have been introduced to measure missing value distribution of samples, and the minority samples are divided into high density samples and sparse density samples. We extend the resampling strategies combining a subsample alignment method and structure preserving oversampling method. The weight of sample difference is used to improve classification performance. Furthermore, the filter mechanism is proposed to remove the noise samples with good efficiency. The experimental results show that the proposed method increases performance compared to traditional resampling methods in terms of AUC, F1, and G-mean evaluation metrics.
What problem does this paper attempt to address?