Abstract:Class imbalance and distributional differences in large datasets present significant challenges for classification tasks machine learning, often leading to biased models and poor predictive performance for minority classes. This work introduces two novel undersampling approaches: mutual information-based stratified simple random sampling and support points optimization. These methods prioritize representative data selection, effectively minimizing information loss. Empirical results across multiple classification tasks demonstrate that our methods outperform traditional undersampling techniques, achieving higher balanced classification accuracy. These findings highlight the potential of combining statistical concepts with machine learning to address class imbalance in practical applications.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of class imbalance in machine learning. Specifically, when the number of samples in some classes in a dataset is much larger than that in other classes, it will lead to poor performance of the model in identifying minority - class instances, which in turn affects the overall performance of classification tasks. This imbalance is particularly important in applications such as fraud detection, medical diagnosis, and fault detection, because these fields usually need to accurately identify rare abnormal situations. To address this challenge, the traditional method is to perform undersampling by reducing the number of majority - class samples, but this method has the following limitations: - **Information loss**: Removing majority - class samples may lead to the loss of important information. - **Difficulty in handling class overlap near the decision boundary**: Undersampling may cause the model to perform poorly near the decision boundary. Therefore, this paper proposes two novel undersampling methods: 1. **Mutual Information - based Stratified Simple Random Sampling (MI - SRS)**: - Use mutual information to stratify data to ensure that data points in each stratum are representative. - Then perform simple random sampling within each stratum to preserve the distribution characteristics of the original data. 2. **Support Points Optimization (SPO)**: - Use support point technology to select a representative subset so that the subset is as close as possible to the statistical characteristics of the original data. - Support points achieve this by minimizing the energy distance, thereby ensuring that the selected subset can faithfully reflect the distribution of the original data. The main goals of these two methods are to optimize the data selection process, minimize information loss as much as possible, improve the classification accuracy of minority classes, and maintain computational efficiency at the same time. ### Summary The research problem in this paper can be summarized as: **Can undersampling methods based on stratified simple random sampling combined with mutual information or support points provide more effective solutions than other techniques to address the class imbalance problem and improve the model performance of classification tasks?** By introducing two advanced statistical concepts, mutual information and support points, this paper aims to improve the deficiencies of existing undersampling methods, thereby providing better solutions for the class imbalance problem in practical applications.

Statistical Undersampling with Mutual Information and Support Points

Imbalanced Data Sets Classification Method Based on Over-Sampling Technique

Efficient hybrid oversampling and intelligent undersampling for imbalanced big data classification

Weakly Supervised-Based Oversampling for High Imbalance and High Dimensionality Data Classification

A Novel Adaptive Minority Oversampling Technique for Improved Classification in Data Imbalanced Scenarios

Trainable Undersampling for Class-Imbalance Learning.

A Density-based Under-sampling Algorithm for Imbalance Classification

A Classfication Method For Imbalance Data Set Based on Kernel SMOTE

A Normal Distribution-Based Over-Sampling Approach to Imbalanced Data Classification

Entropy-based Sampling Approaches for Multi-Class Imbalanced Problems

Minimum Enclosing Ball Synthetic Minority Oversampling Technique from a Geometric Perspective

A Synthetic Minority Oversampling Method Based on Local Densities in Low-Dimensional Space for Imbalanced Learning.

Towards Deeper Insights into Deep Learning from Imbalanced Data.

Resampling approach for imbalanced data classification based on class instance density per feature value intervals

Similarity Majority Under-Sampling Technique for Easing Imbalanced Classification Problem

Under-sampling class imbalanced datasets by combining clustering analysis and instance selection

Restoring balance: principled under/oversampling of data for optimal classification

Synthetic oversampling with Mahalanobis distance and local information for highly imbalanced class-overlapped data

A majority affiliation based under-sampling method for class imbalance problem

Undersampling is a Minimax Optimal Robustness Intervention in Nonparametric Classification