Supervised Pretraining through Contrastive Categorical Positive Samplings to Improve COVID-19 Mortality Prediction

Tingyi Wanyan,Mingquan Lin,Eyal Klang,Kartikeya M Menon,Faris F Gulamali,Ariful Azad,Yiye Zhang,Ying Ding,Zhangyang Wang,Fei Wang,Benjamin Glicksberg,Yifan Peng
DOI: https://doi.org/10.1145/3535508.3545541
Abstract:Clinical EHR data is naturally heterogeneous, where it contains abundant sub-phenotype. Such diversity creates challenges for outcome prediction using a machine learning model since it leads to high intra-class variance. To address this issue, we propose a supervised pre-training model with a unique embedded k-nearest-neighbor positive sampling strategy. We demonstrate the enhanced performance value of this framework theoretically and show that it yields highly competitive experimental results in predicting patient mortality in real-world COVID-19 EHR data with a total of over 7,000 patients admitted to a large, urban health system. Our method achieves a better AUROC prediction score of 0.872, which outperforms the alternative pre-training models and traditional machine learning methods. Additionally, our method performs much better when the training data size is small (345 training instances).
What problem does this paper attempt to address?