Selecting Features by their Resilience to the Curse of Dimensionality

Maximilian Stubbemann,Tobias Hille,Tom Hanika
2023-04-17
Abstract:Real-world datasets are often of high dimension and effected by the curse of dimensionality. This hinders their comprehensibility and interpretability. To reduce the complexity feature selection aims to identify features that are crucial to learn from said data. While measures of relevance and pairwise similarities are commonly used, the curse of dimensionality is rarely incorporated into the process of selecting features. Here we step in with a novel method that identifies the features that allow to discriminate data subsets of different sizes. By adapting recent work on computing intrinsic dimensionalities, our method is able to select the features that can discriminate data and thus weaken the curse of dimensionality. Our experiments show that our method is competitive and commonly outperforms established feature selection methods. Furthermore, we propose an approximation that allows our method to scale to datasets consisting of millions of data points. Our findings suggest that features that discriminate data and are connected to a low intrinsic dimensionality are meaningful for learning procedures.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the impact of the "curse of dimensionality" in high - dimensional datasets on feature selection. Specifically, the author points out that although existing feature selection methods are usually based on feature relevance and importance measures, these methods rarely consider the impact of the "curse of dimensionality". This leads to the fact that in high - dimensional datasets, even if relevant features are selected, it may still be impossible to effectively distinguish different data points due to the curse of dimensionality. Therefore, the paper proposes a new unsupervised feature selection method. This method selects features that can mitigate the impact of the curse of dimensionality by evaluating the discriminative ability of features on data subsets of different sizes. This method not only improves the effectiveness of feature selection but also can be scaled on large - scale datasets. ### Main contributions of the paper: 1. **Proposing a new feature selection method**: This method selects features by evaluating their discriminative ability on data subsets, thereby mitigating the impact of the curse of dimensionality. 2. **Introducing the concept of intrinsic dimension**: Use the intrinsic dimension to quantify the discriminative ability of features and apply it to feature selection. 3. **Proposing an approximation algorithm**: This algorithm can efficiently perform feature selection on large - scale datasets. 4. **Experimental verification**: Through experiments on multiple real - world datasets, it is proved that the proposed method is competitive in classification tasks and even outperforms existing feature selection methods. ### Method overview: - **Definition of feature discriminative ability**: The paper defines the discriminative ability (discriminability) of a feature, which measures the ability of the feature to distinguish data points on data subsets of different sizes. - **Normalized discriminative ability and intrinsic dimension**: By normalizing the discriminative ability, the normalized intrinsic dimension is defined for ranking and selecting features. - **Approximation algorithm**: In order to be applied on large - scale datasets, the paper proposes an approximation algorithm based on support sequences, which can effectively calculate the discriminative ability and intrinsic dimension of features. ### Experimental results: - **Logistic regression experiment**: On the OpenML - CC18 benchmark dataset, the proposed method outperforms random selection and other baseline methods in most cases. - **Graph neural network experiment**: On the Open Graph Benchmark dataset, the proposed method performs excellently in feature selection, especially on large - scale datasets. In conclusion, by introducing the discriminative ability and intrinsic dimension of features, this paper proposes a new unsupervised feature selection method, effectively solves the curse of dimensionality problem in high - dimensional datasets, and verifies its effectiveness and competitiveness in multiple experiments.