Learning Confidence Bounds for Classification with Imbalanced Data

Matt Clifford,Jonathan Erskine,Alexander Hepburn,Raúl Santos-Rodríguez,Dario Garcia-Garcia
2024-10-01
Abstract:Class imbalance poses a significant challenge in classification tasks, where traditional approaches often lead to biased models and unreliable predictions. Undersampling and oversampling techniques have been commonly employed to address this issue, yet they suffer from inherent limitations stemming from their simplistic approach such as loss of information and additional biases respectively. In this paper, we propose a novel framework that leverages learning theory and concentration inequalities to overcome the shortcomings of traditional solutions. We focus on understanding the uncertainty in a class-dependent manner, as captured by confidence bounds that we directly embed into the learning process. By incorporating class-dependent estimates, our method can effectively adapt to the varying degrees of imbalance across different classes, resulting in more robust and reliable classification outcomes. We empirically show how our framework provides a promising direction for handling imbalanced data in classification tasks, offering practitioners a valuable tool for building more accurate and trustworthy models.
Machine Learning
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve The paper attempts to address the issue of class imbalance in classification tasks. In real-world datasets, the number of samples in different classes is often imbalanced. For example, in medical diagnosis datasets, the number of samples of healthy individuals is much greater than that of individuals with a rare disease. This imbalance can cause traditional classification algorithms to be biased towards the majority class, resulting in poor predictive performance on the minority class. Although there are existing methods such as undersampling, oversampling, and cost-sensitive learning to tackle this problem, these methods have their limitations, such as information loss and the introduction of new biases. The paper proposes a new framework that leverages learning theory and concentration inequalities to overcome the shortcomings of traditional methods. This approach embeds class-dependent confidence intervals directly into the learning process to understand and handle the uncertainty of the minority class. This can effectively adapt to different degrees of class imbalance, thereby improving the robustness and reliability of classification results. Specifically, the method in the paper adjusts the bias term of the pre-trained classifier to reflect the uncertainty caused by the smaller number of minority class samples. This approach is not only more rigorous theoretically but also performs well in practical applications, especially when the pre-trained classifier has already learned a good representation of the data.