Spectral clustering based oversampling:oversampling taking within class ;imbalance into consideration

Zichao LUO,Sun JIN,Xuefeng QIU
DOI: https://doi.org/10.3778/j.issn.1002-8331.1312-0148
2014-01-01
Abstract:Imbalanced datasets are one of the most crucial challenges encountered by data mining techniques. Oversam-pling has been proven to be a very effective method in dealing with imbalanced datasets. However, traditional oversam-pling methods pay no attention to within class imbalance which is pervasive in real world datasets. To resolve this prob-lem, this paper proposes an oversampling method based on modified spectral clustering. This method first automatically decides the best number of clusters. Then modified spectral clustering is applied to minority samples. Based on the num-ber of samples contained in each cluster, this proposal judges the number of samples which shall be generated inside each cluster to get a dataset which is balanced both between and within class. This method is tested in 4 real world datasets and one simulated dataset. It is proven to be effective. Moreover, a comparison between traditional k-means clustering based oversampling and the method proposed in this paper is conducted. The results are analyzed and explained.
What problem does this paper attempt to address?