Instance Hardness and Multivariate Gaussian Distribution-Based Oversampling Technique for Imbalance Classification

Xie Jie,Zhu Mingying,Hu Kai,Zhang Jinglan
DOI: https://doi.org/10.1007/s10044-022-01129-5
IF: 2.307
2023-01-01
Pattern Analysis and Applications
Abstract:Imbalance classification has received great attention due to its various real-world applications. Data-level approaches are the most convenient to address data imbalance, whereas oversampling is the most deeply explored. However, most previous studies used distance-based factors to select minority class instances for oversampling. Thus, the synthetic instances often did not follow the distribution of the original minority class instances. In this work, we propose a novel oversampling method based on instance hardness and multivariate Gaussian distribution. First, a fused feature set including k-disagree value and classification error is used for selecting and weighting minority class instances for oversampling. Here, the k-disagree value is also used to filter majority class instances. Then, multivariate Gaussian distribution is fitted to the subset of selected minority class instances, where the selection of subset is based on closest- and cluster-based methods. Next, new instances are generated based on the subset distribution. Finally, Euclidean distance-based instance selection is investigated for improved imbalance classification performance. Experimental results on the KEEL dataset repository show that our proposed method can outperform the other compared oversamplers in terms of both AUC and G-mean.
What problem does this paper attempt to address?