TSFNFS: two-stage-fuzzy-neighborhood feature selection with binary whale optimization algorithm
Lin Sun,Xinya Wang,Weiping Ding,Jiucheng Xu,Huili Meng
DOI: https://doi.org/10.1007/s13042-022-01653-0
2023-01-26
International Journal of Machine Learning and Cybernetics
Abstract:The optimal global feature subset cannot be found easily due to the high cost, and most swarm intelligence optimization-based feature selection methods are inefficient in handling high-dimensional data. In this study, a two-stage feature selection model based on fuzzy neighborhood rough sets (FNRS) and binary whale optimization algorithm (BWOA) is developed. First, to denote the fuzziness of samples for mixed data with symbolic and numerical features, fuzzy neighborhood similarity is presented to study the similarity matrix and fuzzy membership degree, and the lower and upper approximations can be developed to present new FNRS model. Fuzzy neighborhood-based uncertainty measures such as dependence degree, knowledge granularity, and entropy measures are studied. From the viewpoints of algebra and information, fuzzy knowledge granularity conditional entropy is presented to form a preselected feature reduction set in the first stage. Second, the cosine curve change is added to develop a new control factor, which slows down the convergence rate of BWOA in the early iteration to fully explore the global, and accelerates the convergence rate in the late iteration. Integrating dependence degree with fuzzy knowledge granularity conditional entropy, a new fitness function is designed for selecting an optimal feature subset in this second stage. Two strategies are fused to avoid BWOA falling into the local optimum: the population partition strategy with the adaptive neighborhood search radius to divide the whale population and the local interference strategy of the elite subgroup to adjust the whale position update. Finally, a two-stage feature selection algorithm is designed, where the Fisher score algorithm is employed to preliminarily delete those redundancy features of high-dimensional datasets. Experiments on six UCI datasets and five gene expression datasets show that our algorithm is valid compared to other related algorithms.