Feature Selection Based on Data Clustering

Hongzhi Liu,Zhonghai Wu,Xing Zhang
DOI: https://doi.org/10.1007/978-3-319-22180-9_23
2015-01-01
Abstract:Feature selection is an important step for data mining and machine learning. It can be used to reduce the requirement of data measurement and storage, and defy the curse of dimensionality to improve the prediction performance. In this paper, we propose a feature selection method via mutual information estimation. It avoids the calculation of high-dimensional mutual information by transforming the high-dimensional feature space into one dimension through a novel supervised clustering method. Experimental results on ten benchmark data sets show that: (1) the performances of kNN, naive Bayes classifier, and C4.5 using much less features selected by the proposed method are similar or even better than those on the original data sets with the whole feature set; (2) different from most of state-of-the-art methods which require to setting the number of features to select in prior, the proposed method can automatically determine the proper size of selected feature subsets.
What problem does this paper attempt to address?