Handling the Missing Data Problem in Electronic Health Records for Cancer Prediction.

Xudong Zhang,Jiehao Xiao,Yifei Gong,Ning Yu,Wei Zhang,Sunghoon Jang,Feng Gu
DOI: https://doi.org/10.22360/springsim.2020.msm.006
2020-01-01
Abstract:Electronic health records (EHRs) are the records containing the patients’ clinic information. The EHRs have been widely used in disease diagnosis and therapy due to the numerous and valuable medical information in them. However, the missing data problem of EHRs hinders the usage. Replacing the missing data with mean values is an approach of data imputation. But, that method weakens the feature importance. In this study, we use the expectation-maximization (EM) algorithm to impute the missing data in EHRs. Some machine learning models, including artificial neural network, logistic regression, support vector machine, and random forests are used to evaluate the effectiveness of data imputation. The experimental results show that the prediction accuracies of cancers by using those models on the EHRs imputed by EM algorithm are higher than those by mean values, which indicates the EM algorithm is able to provide accurate estimations in data imputation of EHRs.
What problem does this paper attempt to address?