Shap-Select: Lightweight Feature Selection Using SHAP Values and Regression

Egor Kraev,Baran Koseoglu,Luca Traverso,Mohammed Topiwalla
2024-10-09
Abstract:Feature selection is an essential process in machine learning, especially when dealing with high-dimensional datasets. It helps reduce the complexity of machine learning models, improve performance, mitigate overfitting, and decrease computation time. This paper presents a novel feature selection framework, shap-select. The framework conducts a linear or logistic regression of the target on the Shapley values of the features, on the validation set, and uses the signs and significance levels of the regression coefficients to implement an efficient heuristic for feature selection in tabular regression and classification tasks. We evaluate shap-select on the Kaggle credit card fraud dataset, demonstrating its effectiveness compared to established methods such as Recursive Feature Elimination (RFE), HISEL (a mutual information-based feature selection method), Boruta and a simpler Shapley value-based method. Our findings show that shap-select combines interpretability, computational efficiency, and performance, offering a robust solution for feature selection.
Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the problem of effective feature selection in high-dimensional datasets. Specifically, the authors propose a new feature selection framework called shap-select, which combines SHAP values (Shapley Additive Explanations) and statistical significance testing. The aim is to improve the efficiency, interpretability, and performance of feature selection. The paper mentions that existing feature selection methods face issues such as high computational complexity, potential noise introduction, and overfitting when dealing with high-dimensional datasets. These issues are particularly prominent in fields like healthcare, finance, and bioinformatics. Shap-select addresses these problems by performing linear or logistic regression on the SHAP values of the target variable and features on a validation set, and efficiently filtering features based on the sign and significance level of the regression coefficients. To validate the effectiveness of shap-select, the authors conducted experiments on the Kaggle credit card fraud detection dataset, comparing shap-select with existing methods such as Recursive Feature Elimination (RFE), HISEL (a mutual information-based feature selection method), and Boruta. The experimental results show that shap-select maintains high model performance while having lower computational costs, effectively reducing the number of features and improving the interpretability and predictive performance of the model.