Abstract:The problems of imbalanced datasets are generally considered asymmetric issues. In asymmetric problems, artificial intelligence models may exhibit different biases or preferences when dealing with different classes. In the process of addressing class imbalance learning problems, the classification model will pay too much attention to the majority class samples and cannot guarantee the classification performance of the minority class samples, which might be more valuable. By synthesizing the minority class samples and changing the data distribution, unbalanced datasets can be optimized. Traditional oversampling algorithms have problems of blindness and boundary ambiguity when synthesizing new samples. A modified reclassification algorithm based on Gaussian distribution is put forward. First, the minority class samples are reclassified by the KNN algorithm. Then, different synthesis strategies are selected according to the combination of the minority class samples, and the Gaussian distribution is used to replace the uniform random distribution for interpolation operation under certain classification conditions to reduce the possibility of generating noise samples. The experimental results indicate that the proposed oversampling algorithm can achieve a performance improvement of 2∼8% in evaluation metrics, including G-mean, F-measure, and AUC, compared to traditional oversampling algorithms.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the blindness and boundary ambiguity in the process of synthesizing new samples by traditional over - sampling algorithms when dealing with imbalanced data sets. Specifically: 1. **Blindness**: When generating new minority - class samples, traditional over - sampling algorithms do not fully consider the distribution characteristics of minority - class samples and may generate noise samples. These samples not only do not provide useful information but may also interfere with the performance of the classifier. 2. **Boundary ambiguity**: Traditional over - sampling algorithms ignore the boundary characteristics in the sample set, resulting in the generation of too many new samples in areas far from the class boundaries. This may significantly affect the performance of the classifier and blur the boundaries between the two classes. To solve these problems, the paper proposes an improved over - sampling algorithm based on K - nearest neighbors (KNN) and Gaussian distribution interpolation optimization - GI - SMOTE (Gaussian Interpolation SMOTE). This algorithm improves the deficiencies of traditional over - sampling algorithms through the following steps: 1. **Classification stage**: - Use the KNN algorithm to re - classify minority - class samples and divide them into three categories: "noise", "dangerous", and "safe" according to the proximity type of each minority - class sample in the overall data set. - "Noise" samples are completely surrounded by majority - class samples and are considered invalid; "safe" samples are mainly surrounded by minority - class samples and are considered valid; "dangerous" samples are surrounded by both minority - class samples and majority - class samples and require special treatment. 2. **Data synthesis stage**: - In the data synthesis stage, "noise" samples are not used, and only "dangerous" and "safe" samples are used for interpolation operations. - For "dangerous" samples, different synthesis strategies are selected according to the type of their nearest neighbor samples: - If the nearest neighbor sample belongs to "safe" samples, a Gaussian distribution is used to generate new samples to emphasize the attribute information of "dangerous" samples. - If the nearest neighbor sample also belongs to "dangerous" samples, two Gaussian distributions with different means are used to generate new samples to reduce the interference of the generated samples on the classification performance. Through these improvements, the GI - SMOTE algorithm can more accurately preserve the characteristics of minority - class samples when generating new samples while avoiding generating noise samples and blurring class boundaries. Experimental results show that compared with traditional SMOTE, Borderline - SMOTE, and ADASYN algorithms, the GI - SMOTE algorithm achieves better classification performance on multiple imbalanced data sets, especially in evaluation indicators such as G - mean, F - measure, and AUC.

Improved Oversampling Algorithm for Imbalanced Data Based on K-Nearest Neighbor and Interpolation Process Optimization

Imbalanced Data Sets Classification Method Based on Over-Sampling Technique

Imbalanced Data Classification Algorithm Based on Integrated Sampling and Ensemble Learning.

Gaussian Distribution Based Oversampling for Imbalanced Data Classification

An Improving Majority Weighted Minority Oversampling Technique for Imbalanced Classification Problem

An Improved MAHAKIL Oversampling Method for Imbalanced Dataset Classification

A Novel Adaptive Minority Oversampling Technique for Improved Classification in Data Imbalanced Scenarios

Improved SVM algorithm for imbalanced dataset classification

Hybrid SVM algorithm oriented to classifying imbalanced datasets

A Diversity-Based Synthetic Oversampling Using Clustering for Handling Extreme Imbalance

Adaptively weighted three-way decision oversampling: A cluster imbalanced-ratio based approach

A hybrid sampling method for highly imbalanced and overlapped data classification with complex distribution

Over-sampling Algorithm Based on Preliminary Classification in Imbalanced Data Sets Learning

Oversampling With Reliably Expanding Minority Class Regions for Imbalanced Data Learning

Natural local density-based adaptive oversampling algorithm for imbalanced classification

A Normal Distribution-Based Over-Sampling Approach to Imbalanced Data Classification

An oversampling FCM-KSMOTE algorithm for imbalanced data classification

A Classfication Method For Imbalance Data Set Based on Kernel SMOTE

An Over Sampling Method of Unbalanced Data Based on Ant Colony Clustering

Over-sampling algorithm for imbalanced data classification