Abstract:BACKGROUND In emergency departments (ED), timely rescue is very important as patients’ conditions usually deteriorate rapidly. Early diagnosis can increase patients’ chances of survival. Early diagnosis can be improved by predictive models based on machine learning using Electronic Medical Record (EMR) data. However, ED data are usually imbalanced, having missing values and sparse features. These quality issues make it challenging to build early identification models for diseases in ED. OBJECTIVE The objective of this study is to propose a systematic approach to deal with missing, imbalanced and sparse feature problems of ED data. METHODS We used random forest and K-means algorithms to interpolate missing values and under-sample data. Regarding sparse features, we used principal component analysis to reduce dimensions. For continuous and discrete variables, the decision coefficient R2 and Kappa coefficient are used to evaluate the performance respectively. The area under the receiver operating characteristic curve (AUC) and the area under the precision-recall curve (AUPRC) are used to estimate the model performance. To further evaluate the proposed approach, we carried out a case study using an ED dataset extracted from Hainan Hospital of Chinese PLA General Hospital. A logistic regression model for patient condition worsening prediction was built out of the data processed by the proposed approach. RESULTS A total of 1085 patients with rescue record and 17959 patients without rescue record were collected, which were significantly imbalanced. 275, 402 and 891 variables are extracted from laboratory tests, medications and diagnosis, respectively. After data preprocessing, the median R2 of random forest interpolation for continuous variables is 0.623 (IQR: 0.647), and the median of Kappa coefficient for discrete variable interpolation is 0.444 (IQR: 0.285). The logistic regression model constructed using the initial diagnostic data has poor performance and variable separation, which is reflected in the abnormally high OR values of the two variables of cardiac arrest and respiratory arrest (27857.4 and 9341.6) and an abnormal confidence interval. Using the processed data, the recall of the model reaches 0.77, F1-SCORE is 0.74, and AUC is 0.64. CONCLUSIONS We proposed a machine learning method to deal with data quality issues such as missing data, data imbalance, and sparse features in emergency data, so as to improve data availability. A preliminary case study indicate the results produced by the proposed method can be used for building prediction model for emergency patients.

Handling the Missing Data Problem in Electronic Health Records for Cancer Prediction.

Mining for equitable health: Assessing the impact of missing data in electronic health records

Using Electronic Health Records and Machine Learning to Predict Postpartum Depression.

Extremely missing numerical data in Electronic Health Records for machine learning can be managed through simple imputation methods considering informative missingness: A comparative of solutions in a COVID-19 mortality case study

Moving Beyond Medical Statistics: A Systematic Review on Missing Data Handling in Electronic Health Records

Imputation of missing values for electronic health record laboratory data

Assessing the Impact of Imputation on the Interpretations of Prediction Models: A Case Study on Mortality Prediction for Patients with Acute Myocardial Infarction.

Comparison of machine learning methods for clinical data imputation among a real-world lung cancer cohort

Dealing with the Missing, Imbalanced and Sparse Features Problems in Emergency Data Using Random Forest, K-means and PCA Respectively (Preprint)

Imputation techniques on missing values in breast cancer treatment and fertility data

Missing Values and Imputation in Healthcare Data: Can Interpretable Machine Learning Help?

On the Performance of Imputation Techniques for Missing Values on Healthcare Datasets

Integrated Convolutional and Recurrent Neural Networks for Health Risk Prediction using Patient Journey Data with Many Missing Values

Handling missing values in healthcare data: A systematic review of deep learning-based imputation techniques

Benchmarking missing-values approaches for predictive models on health databases

Missing Data Exploration: Highlighting Graphical Presentation of Missing Pattern.

A Dynamic Model for Imputing Missing Medical Data: A Multiobjective Particle Swarm Optimization Algorithm

Machine Learning Based Missing Values Imputation in Categorical Datasets

Self-Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study

Attention-based Imputation of Missing Values in Electronic Health Records Tabular Data

Improving prediction of cervical cancer using KNN imputer and multi-model ensemble learning