Statistical Undersampling with Mutual Information and Support Points

Alex Mak,Shubham Sahoo,Shivani Pandey,Yidan Yue,Linglong Kong
2024-12-19
Abstract:Class imbalance and distributional differences in large datasets present significant challenges for classification tasks machine learning, often leading to biased models and poor predictive performance for minority classes. This work introduces two novel undersampling approaches: mutual information-based stratified simple random sampling and support points optimization. These methods prioritize representative data selection, effectively minimizing information loss. Empirical results across multiple classification tasks demonstrate that our methods outperform traditional undersampling techniques, achieving higher balanced classification accuracy. These findings highlight the potential of combining statistical concepts with machine learning to address class imbalance in practical applications.
Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of class imbalance in machine learning. Specifically, when the number of samples in some classes in a dataset is much larger than that in other classes, it will lead to poor performance of the model in identifying minority - class instances, which in turn affects the overall performance of classification tasks. This imbalance is particularly important in applications such as fraud detection, medical diagnosis, and fault detection, because these fields usually need to accurately identify rare abnormal situations. To address this challenge, the traditional method is to perform undersampling by reducing the number of majority - class samples, but this method has the following limitations: - **Information loss**: Removing majority - class samples may lead to the loss of important information. - **Difficulty in handling class overlap near the decision boundary**: Undersampling may cause the model to perform poorly near the decision boundary. Therefore, this paper proposes two novel undersampling methods: 1. **Mutual Information - based Stratified Simple Random Sampling (MI - SRS)**: - Use mutual information to stratify data to ensure that data points in each stratum are representative. - Then perform simple random sampling within each stratum to preserve the distribution characteristics of the original data. 2. **Support Points Optimization (SPO)**: - Use support point technology to select a representative subset so that the subset is as close as possible to the statistical characteristics of the original data. - Support points achieve this by minimizing the energy distance, thereby ensuring that the selected subset can faithfully reflect the distribution of the original data. The main goals of these two methods are to optimize the data selection process, minimize information loss as much as possible, improve the classification accuracy of minority classes, and maintain computational efficiency at the same time. ### Summary The research problem in this paper can be summarized as: **Can undersampling methods based on stratified simple random sampling combined with mutual information or support points provide more effective solutions than other techniques to address the class imbalance problem and improve the model performance of classification tasks?** By introducing two advanced statistical concepts, mutual information and support points, this paper aims to improve the deficiencies of existing undersampling methods, thereby providing better solutions for the class imbalance problem in practical applications.