Semi-supervised feature selection by minimum neighborhood redundancy and maximum neighborhood relevancy

Damo Qian,Keyu Liu,Shiming Zhang,Xibei Yang
DOI: https://doi.org/10.1007/s10489-024-05578-9
IF: 5.3
2024-06-14
Applied Intelligence
Abstract:In the realm of machine learning, feature selection emerges as a prevalent data preprocessing technique, playing a crucial role in enhancing model performance across diverse downstream tasks such as fault diagnosis, biological recognition, and object detection. Nevertheless, the challenge of incomplete supervision, stemming from limited labeled data availability, poses a formidable obstacle in acquiring the optimal feature subset for model input. To address the problem that label scarcity may deteriorate the feature evaluation and selection, we introduce a novel semi-supervised feature selection algorithm termed Semi2MNR integrating the principles of Minimum Neighborhood Redundancy and Maximum Neighborhood Relevancy. Firstly, k -nearest neighborhood granulation is leveraged to construct a collection of neighborhood uncertainty measures from the perspective of information theory. Then, the neighborhood mutual information is expressed to assess the feature-to-label relevance based on labeled samples and feature-to-feature redundance based on unlabeled samples. Finally, as the evaluation criterion of min-neighborhood-redundancy and max-neighborhood-relevancy is constrained, a forward sequential searching algorithm is devised to identify the min-redundant and max-relevant features. The empirical findings from our experiments on 12 UCI data sets unequivocally demonstrate the superiority of Semi2MNR in the presence of partially labeled data with varying labeling rates. Comparative analysis against other feature selection algorithms suggests that CART, KNN, and SVM classifiers fed with features selected by Semi2MNR consistently yield optimal accuracies.
computer science, artificial intelligence
What problem does this paper attempt to address?