A cluster impurity-based hybrid resampling for imbalanced classification problems

Cheng, Ke-Yong
DOI: https://doi.org/10.1007/s10489-024-05644-2
IF: 5.3
2024-07-20
Applied Intelligence
Abstract:As one of the supervised learning techniques, classification plays a crucial role in categorizing and predicting the observations across a wide range of machine learning applications such as software defect detection, fraud detection in financial sector, fault and defect detection in manufacturing industry, medical diagnosis, etc. However, most classification algorithms have been developed with the assumption that the class distribution is balanced although unequal class distributions are quite common in many practical cases. When a class imbalance problem exists, in general, the classifier tends to become biased towards the majority class and thus the minority class instances are often misclassified to the majority class. Along with the class imbalance problem, the class overlap is also known as one of the sources that makes the learning task become difficult or sometimes deteriorates the classification performance, especially, when class imbalance problem also exists. Thus, in this research, we propose a cluster impurity-based hybrid resampling method including the partially balanced strategy to improve the classification performance of class imbalanced data with considering intra-cluster class imbalance and inter-cluster overlap problems. Specifically, several clustering methods are employed for identifying the groups (i.e., clusters) of all the instances and the cluster impurity of each instance is computed for measuring the degree of cluster overlap. Then, based on the cluster impurity, the instances are generated and eliminated recursively. To demonstrate the effectiveness of the proposed method, comprehensive experiments are conducted on forty imbalanced datasets and two non-parametric hypothesis tests are employed to show the statistical difference in classification performances between the proposed method and other traditional resampling methods.
computer science, artificial intelligence
What problem does this paper attempt to address?