Abstract:A common approach for feature selection is to examine the variable importance scores for a machine learning model, as a way to understand which features are the most relevant for making predictions. Given the significance of feature selection, it is crucial for the calculated importance scores to reflect reality. Falsely overestimating the importance of irrelevant features can lead to false discoveries, while underestimating importance of relevant features may lead us to discard important features, resulting in poor model performance. Additionally, black-box models like XGBoost provide state-of-the art predictive performance, but cannot be easily understood by humans, and thus we rely on variable importance scores or methods for explainability like SHAP to offer insight into their behavior. In this paper, we investigate the performance of variable importance as a feature selection method across various black-box and interpretable machine learning methods. We compare the ability of CART, Optimal Trees, XGBoost and SHAP to correctly identify the relevant subset of variables across a number of experiments. The results show that regardless of whether we use the native variable importance method or SHAP, XGBoost fails to clearly distinguish between relevant and irrelevant features. On the other hand, the interpretable methods are able to correctly and efficiently identify irrelevant features, and thus offer significantly better performance for feature selection.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the performance of different machine learning models in feature selection, especially the effectiveness of variable importance as a feature selection method. Specifically, the author focuses on how to distinguish relevant and irrelevant features, and the performance differences of different models (such as CART, Optimal Trees, XGBoost and its method combined with SHAP) in this regard. The paper verifies the capabilities of these models in feature selection through a series of experiments, especially in the case of high - dimensional data sets and biased data, whether these models can accurately identify truly important features and avoid wrongly assigning excessive importance to irrelevant features or underestimating the importance of relevant features. ### Background and Problem Description of the Paper In the modern Internet era, a large amount of data is generated every day, and these data usually contain thousands of features. For example, tracking users' purchasing behaviors on web pages, or detailed sensor information collected while driving a car. This rich information provides a perfect environment for making full use of machine learning. However, these powerful machine learning models are often black - box models, and it is very difficult for humans to understand how the input features are used to construct predictions. In the context of such high - dimensional data sets, this limitation is particularly worrying because it complicates the understanding of the relative quality of the various features collected. In fact, most or all of the prediction performance can often be achieved by a small number of features, which leads to unnecessary computational complexity, longer model training time, and a decline in out - of - sample performance because the algorithm may not be able to correctly detect the features that drive the signal in the face of many noisy features. ### Research Objectives The main purpose of the paper is to compare and evaluate the performance of different machine learning models (including black - box models and interpretable models) in feature selection, especially their ability to identify relevant features and exclude irrelevant features. The research focuses on: - The performance of **XGBoost** and **SHAP** in feature selection, especially in simple and complex scenarios. - The advantages and limitations of interpretable models such as **CART** and **Optimal Trees** in feature selection. ### Experimental Design The paper evaluates the performance of different models through the following experiments: 1. **Fixed Regression Tree Experiment**: Generate data following a fixed regression tree structure and evaluate the performance of each method under different noise levels. 2. **Random Classification Tree Experiment**: Generate a random classification tree structure and evaluate the performance of each method in feature selection. 3. **Random Classification Tree Experiment with Biased Data**: Introduce different numbers of unique values in the generated data and evaluate the performance of each method in the presence of biased data. ### Main Conclusions - **XGBoost** (whether combined with SHAP or not) performs poorly in feature selection and will wrongly assign high importance to irrelevant features even in very simple scenarios. - **Optimal Trees** is the strongest feature selection method, which can efficiently and accurately identify irrelevant features and is not affected by data bias. - Although **CART** performs poorly in the presence of biased data, it is still better than **XGBoost**, especially with a small sample size. ### Significance The results of this study are of great significance for feature selection and model interpretation, especially when using black - box models. The paper emphasizes that caution should be exercised when using variable importance in feature selection and model interpretation, especially for **XGBoost** and **SHAP**, because they may wrongly identify the importance of features. In contrast, interpretable single - tree methods (such as **Optimal Trees**) are not only transparent but also perform well in eliminating irrelevant features and usually do not sacrifice prediction performance.

Comparing interpretability and explainability for feature selection

Shapley variable importance cloud for interpretable machine learning

Variable importance analysis with interpretable machine learning for fair risk prediction

Shapley variable importance clouds for interpretable machine learning

Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods

Comparison of feature importance measures as explanations for classification models

Shapley Marginal Surplus for Strong Models

Better Model Selection with a new Definition of Feature Importance

Increasing transparency in machine learning through bootstrap simulation and shapely additive explanations

Explainability is NOT a Game

ShapG: new feature importance method based on the Shapley value

Shapley variable importance cloud for machine learning models

From SHAP Scores to Feature Importance Scores

Feature Importance versus Feature Influence and What It Signifies for Explainable AI

Enhancing Feature Selection and Interpretability in AI Regression Tasks Through Feature Attribution

From unbiased MDI Feature Importance to Explainable AI for Trees

Performance and Interpretability Comparisons of Supervised Machine Learning Algorithms: An Empirical Study

On the Failings of Shapley Values for Explainability

A Computational Exploration of Emerging Methods of Variable Importance Estimation

Fair Feature Importance Scores for Interpreting Tree-Based Methods and Surrogates

Explaining black box decisions by Shapley cohort refinement