Relabeling & raking algorithm for imbalanced classification

Seunghwan Park,Hae-Hwan Lee,Jongho Im
DOI: https://doi.org/10.1016/j.eswa.2024.123274
IF: 8.5
2024-01-25
Expert Systems with Applications
Abstract:Imbalanced data classification, where the class distribution exhibits significant skewness, presents a challenging problem in binary classification tasks. This issue is particularly pronounced for high-dimensional data, as the presence of unequal class proportions can significantly degrade the performance of classifiers. Existing approaches to address this problem involve undersampling the majority class or oversampling the minority class to create balanced samples, thereby improving classification performance. However, extending these sampling methods to high-dimensional data and mixed data, which includes categorical variables, is nontrivial due to the need for approximating attribute distributions. In this paper, we propose a novel sampling strategy that incorporates raking and relabeling procedures to construct balanced samples by imputing attribute values from the majority class to the minority class. Our proposed algorithms demonstrate comparable performance to popular existing methods, while offering greater flexibility in accommodating diverse data shapes and attribute sizes. The practical appeal of our sampling algorithm lies in its ability to generate synthetic data for oversampling without the reliance on density estimation and its capability to handle mixed-type variables seamlessly. Furthermore, our sampling strategy exhibits robustness across different classifiers, as the choice of classifier does not significantly impact classification performance.
computer science, artificial intelligence,engineering, electrical & electronic,operations research & management science
What problem does this paper attempt to address?