Feature Selection Based on Wasserstein Distance

Fuwei Li
2024-11-12
Abstract:In this paper, we present a novel feature selection method based on the Wasserstein distance. Feature selection plays a critical role in reducing the dimensionality of input data, thereby improving machine learning efficiency and generalization performance. Unlike traditional feature selection approaches that rely on criteria such as correlation or KL divergence, our method leverages the Wasserstein distance to measure the similarity between distributions of selected features and original features. This approach inherently accounts for similarities between classes, making it robust in scenarios involving noisy labels. Experimental results demonstrate that our method outperforms traditional approaches, particularly in challenging settings involving noisy labeled data.
Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is how to effectively handle high - dimensional data and noisy labels in feature selection. Specifically, the author proposes a new feature - selection method based on the Wasserstein distance. Different from traditional feature - selection methods, this method uses the Wasserstein distance to measure the similarity between the distribution of the selected features and the original feature distribution. This method can not only better capture the inter - class relationships but also maintain robustness in the case of noisy labels. ### Background and Motivation of the Paper 1. **Importance of Feature Selection**: - **Improving the Efficiency of Machine Learning**: By reducing the dimension of input data, the training process can be accelerated and the generalization performance of the model can be improved. - **Reducing Storage Costs**: Although the price of storage devices has been decreasing year by year, the growth rate of data volume is faster. Therefore, reducing the data dimension can reduce storage costs. - **Avoiding Overfitting**: In the case of a fixed number of samples, more powerful learning machines are more likely to overfit, and feature selection can help simplify the model, thus obtaining better generalization performance. 2. **Limitations of Existing Methods**: - Traditional feature - selection methods (such as those based on correlation or KL - divergence) cannot handle inter - class relationships well and perform poorly in the case of noisy labels. ### Proposed Method The paper proposes a feature - selection method based on the Wasserstein distance. The Wasserstein distance (also known as Earth Mover's Distance) is a probability - distance metric that can intrinsically use the distance information between classes. Specifically, the author defines an optimization problem: \[ \argmin_{|\theta| = K}\mathbb{E}_X D_{\text{wass}}[p(Y|X), p(Y|X_\theta)] \] where \(D_{\text{wass}}\) is the Wasserstein distance, \(X_\theta\) is the feature subset selected according to the index set \(\theta\), and \(K\) is the number of features to be selected. ### Advantages of the Method 1. **Capturing Inter - class Relationships**: - The Wasserstein distance takes into account the distance between classes, so it is more effective in handling data with a hierarchical structure or multi - class correlations. 2. **Robustness to Noisy Labels**: - In the case of noisy labels, the Wasserstein distance can still maintain good performance because it intrinsically considers the similarity between classes rather than simply comparing probability distributions. ### Experimental Results The experimental results show that this method is superior to traditional feature - selection methods in handling data with noisy labels, especially in more challenging scenarios. ### Summary By introducing the Wasserstein distance, this paper provides a new feature - selection method that can effectively improve the performance and generalization ability of machine - learning models in the case of high - dimensional data and noisy labels.