Semi-supervised Filter Feature Selection Based on Natural Laplacian Score and Maximal Information Coefficient

Quanwang Wu,Kun Cai,Jianxun Sun,Shanwei Wang,Jie Zeng
DOI: https://doi.org/10.1007/s13042-024-02246-9
2024-01-01
Abstract:As a crucial preprocessing step in data mining, feature selection aims to obtain an excellent feature set, so as to improve the accuracy of classifiers and reduce the training time. This task is non-trivial, especially when there are missing labels in datasets. Although some semi-supervised filter feature selection methods have been proposed, they generally fall short in effectively leveraging both labeled and unlabeled information, and lack adaptability to specific datasets. This paper proposes a novel semi-supervised filter feature selection method called NM Score to overcome these shortcomings. Specifically, to calculate the NM Score of a feature, its power of locality preserving and label discrimination in the whole data space is measured via the natural Laplacian score (NLS), which is an improved parameter-free Laplacian score based on natural neighbors. Meanwhile, its correlation with the limited available label information is measured via the general and equitable maximal information coefficient (MIC). Then, NLS and MIC are combined adaptively based on conflict ratios between neighborhood and labels to determine the NM Score of a feature and hence assess its importance. Experiments are conducted based on UCI datasets and high-dimensional gene datasets, and results reveal that NM Score is more effective than several state-of-the-art methods.
What problem does this paper attempt to address?