Adaptive Weighted Over-Sampling for Imbalanced Datasets Based on Density Peaks Clustering with Heuristic Filtering

Xinmin Tao,Qing Li,Wenjie Guo,Chao Ren,Qing He,Rui Liu,JunRong Zou
DOI: https://doi.org/10.1016/j.ins.2020.01.032
IF: 8.1
2020-01-01
Information Sciences
Abstract:Learning from imbalanced datasets poses a major challenge in data mining community. When dealing with imbalanced datasets, conventional classification algorithms generally perform poorly as they are originally designed to work under balanced class distribution scenarios. Although there exist different methods to addressing this issue, sampling methods especially over-sampling techniques have shown great potentials as they aim to improve datasets itself rather than the classifiers, which can allow them to be used for any classifier. In this paper, we propose a novel adaptive weighted over-sampling for imbalanced datasets based on density peaks clustering with heuristic filtering. Unlike other clustering-based over-sampling methods, the proposed approach applies modified density peaks clustering rather than traditional k-means clustering techniques to cluster the minority instances due to its capability of accurately identifying sub-clusters with different sizes and densities, which is beneficial for the proposed method to simultaneously accommodate for between-class and within-class imbalance issues caused by various reasons. Subsequently, the size for each identified sub-cluster to be oversampled is adaptively determined according to its own size and density and then the minority instances within each sub-cluster are oversampled based on their probabilities inversely proportional to their distances to the majority class and their densities with the aim of generating more synthetic minority instances for borderline and sparser ones. Finally, in order to avoid the generation of overlapping, a heuristic filtering strategy is also developed to iteratively move the possibly overlapped minority instances away from the majority class. The extensive experimental results on the different imbalanced datasets demonstrate that the proposed approach can achieve better classification performance in most datasets as compared to the other existing over-sampling techniques. (C) 2020 Elsevier Inc. All rights reserved.
What problem does this paper attempt to address?