Abstract:Background Data discretization is an important preprocessing step in data mining for the transfer of continuous feature values to discrete ones, which allows some specific data mining algorithms to construct more effective models and facilitates the data mining process. Because many medical domain datasets are class imbalanced, data resampling methods, including oversampling, undersampling, and hybrid sampling methods, have been widely applied to rebalance the training set, facilitating effective differentiation between majority and minority classes. Objective Herein, we examine the effect of incorporating both data discretization and data resampling as steps in the analytical process on the classifier performance for class-imbalanced medical datasets. The order in which these two steps are carried out is compared in the experiments. Methods Two experimental studies were conducted, one based on 11 two-class imbalanced medical datasets and the other using 3 multiclass imbalanced medical datasets. In addition, the two discretization algorithms employed are ChiMerge and minimum description length principle (MDLP). On the other hand, the data resampling algorithms chosen for performance comparison are Tomek links undersampling, synthetic minority oversampling technique (SMOTE) oversampling, and SMOTE–Tomek hybrid sampling algorithms. Moreover, the support vector machine (SVM), C4.5 decision tree, and random forest (RF) techniques were used to examine the classification performances of the different approaches. Results The results show that on average, the combination approaches can allow the classifiers to provide higher area under the ROC curve (AUC) rates than the best baseline approach at approximately 0.8%–3.5% and 0.9%–2.5% for twoclass and multiclass imbalanced medical datasets, respectively. Particularly, the optimal results for two-class imbalanced datasets are obtained by performing the MDLP method first for data discretization and SMOTE second for oversampling, providing the highest AUC rate and requiring the least computational cost. For multiclass imbalanced datasets, performing SMOTE or SMOTE–Tomek first for data resampling and ChiMerge second for data discretization offers the best performances. Conclusions Classifiers with oversampling can provide better performances than the baseline method without oversampling. In contrast, performing data discretization does not necessarily make the classifiers outperform the baselines. On average, the combination approaches have potential to allow the classifiers to provide higher AUC rates than the best baseline approach.

A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data

Imbalanced Data Sets Classification Method Based on Over-Sampling Technique

A Hybrid Sampling Algorithm Combining M-SMOTE and ENN Based on Random Forest for Medical Imbalanced Data

A self-inspected adaptive SMOTE algorithm (SASMOTE) for highly imbalanced data classification in healthcare

Distribution-Sensitive Unbalanced Data Oversampling Method for Medical Diagnosis

An oversampling FCM-KSMOTE algorithm for imbalanced data classification

Oversampling for Imbalanced Learning Based on K-Means and SMOTE

An Over Sampling Method of Unbalanced Data Based on Ant Colony Clustering

Interaction effect between data discretization and data resampling for class-imbalanced medical datasets

A Synthetic Minority Oversampling Technique Based on Gaussian Mixture Model Filtering for Imbalanced Data Classification

A New Over-Sampling Technique Based On Svm For Imbalanced Diseases Data

DDSC-SMOTE: an imbalanced data oversampling algorithm based on data distribution and spectral clustering

Over-sampling algorithm for imbalanced data classification

Adaptive K-means Clustering Based Under-Sampling Methods to Solve the Class Imbalance Problem

Application of resampling technique in the classification of imbalanced diabetes data in middle-aged and elderly residents

Clustering-based Undersampling with Random over Sampling Examples and Support Vector Machine for Imbalanced Classification of Breast Cancer Diagnosis

Breast Cancer Diagnosis Using Cluster-based Undersampling and Boosted C5.0 Algorithm

A Particle Swarm Based Hybrid System for Imbalanced Medical Data Sampling

Under-sampling Method Research in Class-Imbalanced Data

Agnes-Smote: An Oversampling Algorithm Based On Hierarchical Clustering And Improved Smote

A cluster-based SMOTE both-sampling (CSBBoost) ensemble algorithm for classifying imbalanced data