Abstract:It is difficult for learning models to achieve high classification performances with imbalanced data sets, because with imbalanced data sets, when one of the classes is much larger than the others, most machine learning and data mining classifiers are overly influenced by the larger classes and ignore the smaller ones. As a result, the classification algorithms often have poor learning performances due to slow convergence in the smaller classes. To balance such data sets, this paper presents a strategy that involves reducing the sizes of the majority data and generating synthetic samples for the minority data. In the reducing operation, we use the box-and-whisker plot approach to exclude outliers and the Mega-Trend-Diffusion method to find representative data from the majority data. To generate the synthetic samples, we propose a counterintuitive hypothesis to find the distributed shape of the minority data, and then produce samples according to this distribution. Four real datasets were used to examine the performance of the proposed approach. We used paired t-tests to compare the Accuracy, G-mean, and F-measure scores of the proposed data pre-processing (PPDP) method merging in the D3C method (PPDP+D3C) with those of the one-sided selection (OSS), the well-known SMOTEBoost (SB) study, and the normal distribution-based oversampling (NDO) approach, and the proposed data pre-processing (PPDP) method. The results indicate that the classification performance of the proposed approach is better than that of above-mentioned methods.

Effective Sample Synthesizing in Kernel Space for Imbalanced Classification

Imbalanced Data Classification Algorithm Based on Integrated Sampling and Ensemble Learning.

A Classfication Method For Imbalance Data Set Based on Kernel SMOTE

Variational autoencoder based synthetic data generation for imbalanced learning

Sample Weighting: an Inherent Approach for Outlier Suppressing Discriminant Analysis

A Synthetic Minority Oversampling Method Based on Local Densities in Low-Dimensional Space for Imbalanced Learning.

Synthetic Information towards Maximum Posterior Ratio for deep learning on Imbalanced Data

Weakly Supervised-Based Oversampling for High Imbalance and High Dimensionality Data Classification

Confronting Discrimination in Classification: Smote Based on Marginalized Minorities in the Kernel Space for Imbalanced Data

Improving SVM Classification with Imbalance Data Set

Improved Oversampling Algorithm for Imbalanced Data Based on K-Nearest Neighbor and Interpolation Process Optimization

GenSample: A Genetic Algorithm for Oversampling in Imbalanced Datasets

Improved SVM algorithm for imbalanced dataset classification

An Improved Algorithm for Imbalanced Data and Small Sample Size Classification

Detecting representative data and generating synthetic samples to improve learning accuracy with imbalanced data sets

Synthetic Over-sampling with the Minority and Majority Classes for Imbalance Problems

Noise-free sampling with majority framework for an imbalanced classification problem

A Normal Distribution-Based Over-Sampling Approach to Imbalanced Data Classification

Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance

Hybrid SVM algorithm oriented to classifying imbalanced datasets