Abstract:Background: Outliers and class imbalance in medical data could affect the accuracy of machine learning models. For physicians who want to apply predictive models, how to use the data at hand to build a model and what model to choose are very thorny problems. Therefore, it is necessary to consider outliers, imbalanced data, model selection, and parameter tuning when modeling. Methods: This study used a joint modeling strategy consisting of: outlier detection and removal, data balancing, model fitting and prediction, performance evaluation. We collected medical record data for all ICH patients with admissions in 2017-2019 from Sichuan Province. Clinical and radiological variables were used to construct models to predict mortality outcomes 90 days after discharge. We used stacking ensemble learning to combine logistic regression (LR), random forest (RF), artificial neural network (ANN), support vector machine (SVM), and k-nearest neighbors (KNN) models. Accuracy, sensitivity, specificity, AUC, precision, and F1 score were used to evaluate model performance. Finally, we compared all 84 combinations of the joint modeling strategy, including training set with and without cross-validated committees filter (CVCF), five resampling techniques (random under-sampling (RUS), random over-sampling (ROS), adaptive synthetic sampling (ADASYN), Borderline synthetic minority oversampling technique (Borderline SMOTE), synthetic minority oversampling technique and edited nearest neighbor (SMOTEENN)) and no resampling, seven models (LR, RF, ANN, SVM, KNN, Stacking, AdaBoost). Results: Among 4207 patients with ICH, 2909 (69.15%) survived 90 days after discharge, and 1298 (30.85%) died within 90 days after discharge. The performance of all models improved with removing outliers by CVCF except sensitivity. For data balancing processing, the performance of training set without resampling was better than that of training set with resampling in terms of accuracy, specificity, and precision. And the AUC of ROS was the best. For seven models, the average accuracy, specificity, AUC, and precision of RF were the highest. Stacking performed best in F1 score. Among all 84 combinations of joint modeling strategy, eight combinations performed best in terms of accuracy (0.816). For sensitivity, the best performance was SMOTEENN + Stacking (0.662). For specificity, the best performance was CVCF + KNN (0.987). Stacking and AdaBoost had the best performances in AUC (0.756) and F1 score (0.602), respectively. For precision, the best performance was CVCF + SVM (0.938). Conclusion: This study proposed a joint modeling strategy including outlier detection and removal, data balancing, model fitting and prediction, performance evaluation, in order to provide a reference for physicians and researchers who want to build their own models. This study illustrated the importance of outlier detection and removal for machine learning and showed that ensemble learning might be a good modeling strategy. Due to the low imbalanced ratio (IR, the ratio of majority class and minority class) in this study, we did not find any improvement in models with resampling in terms of accuracy, specificity, and precision, while ROS performed best on AUC.

Tree-Guided Rare Feature Selection and Logic Aggregation with Electronic Health Records Data

Applying Machine Learning Approaches to Suicide Prediction Using Healthcare Data: Overview and Future Directions

Using Electronic Health Records and Machine Learning to Predict Postpartum Depression.

Suicide Death Predictive Models using Electronic Health Record Data

Automated feature selection of predictors in electronic medical records data

A tree-based gene-environment interaction analysis with rare features

Revealing Suicide Risk of Young Adults Based on Comprehensive Measurements Using Decision Tree Classification

Enhancing Suicide Risk Prediction Models with Temporal Clinical Note Features

Target-based fusion using social determinants of health to enhance suicide prediction with electronic health records

Machine learning for suicide risk prediction in children and adolescents with electronic health records

Interrelated feature selection from health surveys using domain knowledge graph

SCOPE: predicting future diagnoses in office visits using electronic health records

Stabilized Sparse Ordinal Regression for Medical Risk Stratification

Acute coronary syndrome risk prediction based on gradient boosted tree feature selection and recursive feature elimination: A dataset-specific modeling study

Enabling scalable clinical interpretation of ML-based phenotypes using real world data

Joint modeling strategy for using electronic medical records data to build machine learning models: an example of intracerebral hemorrhage

Deep Sequential Models for Suicidal Ideation from Multiple Source Data

Dealing with the Missing, Imbalanced and Sparse Features Problems in Emergency Data Using Random Forest, K-means and PCA Respectively (Preprint)

Supervised Extraction of Diagnosis Codes from EMRs: Role of Feature Selection, Data Selection, and Probabilistic Thresholding

Unsupervised Machine Learning for the Discovery of Latent Disease Clusters and Patient Subgroups Using Electronic Health Records

Rare Codes Count: Mining Inter-code Relations for Long-tail Clinical Text Classification