Mixed Measure-Based Feature Selection Using the Fisher Score and Neighborhood Rough Sets
Sun Lin,Zhang Jiuxiao,Ding Weiping,Xu Jiucheng
DOI: https://doi.org/10.1007/s10489-021-03142-3
IF: 5.3
2022-01-01
Applied Intelligence
Abstract:Existing feature selection methods easily neglect the distribution of data, and require most of the neighborhood radius in neighborhood rough sets (NRS) to be selected artificially. These limitations result in the misclassification of samples. To address these drawbacks, this paper presents a mixed measure-based feature selection method using the Fisher score and an NRS model. First, the variation coefficient of the features in different decision classes is defined to depict the dispersion degree of different features, based on which, the neighborhood class is described to develop a novel NRS model. The concepts of dependency degree, neighborhood knowledge granularity, and average neighborhood entropy are defined, and then a mixed measure combining the information and algebra views is proposed to measure the uncertainty in neighborhood decision systems. Second, the average correlation degree of the feature subset is computed to assess the redundancy of the reduced feature subset. By combining the classification accuracy of the selected features, the reduction rate of the classification result, and the average correlation degree of the reduced feature set, we can construct an adaptive neighborhood radius function to avoid the artificial selection of the optimal neighborhood radius. Then, an optimal feature subset can be obtained according to the internal and external significance of the features. Third, the variation coefficient of the samples in different decision classes in each feature is defined to compute the dispersion degree of the samples, and the average of all samples in each feature is added to the between-class scatter to eliminate the effect of the different measurement dimensions of the features; then, the Fisher score model is improved to eliminate the noise of the high-dimensional data. Finally, a heuristic feature selection algorithm with the Fisher score based on the new NRS model is designed to select an optimal feature subset. Experimental results applied to five low-dimensional UCI datasets and nine high-dimensional gene expression datasets showed that the developed algorithm is effective and can select an optimal reduced subset with high classification accuracy when compared with some of the latest algorithms.