Shap-Select: Lightweight Feature Selection Using SHAP Values and Regression

Egor Kraev,Baran Koseoglu,Luca Traverso,Mohammed Topiwalla

2024-10-09

Abstract:Feature selection is an essential process in machine learning, especially when dealing with high-dimensional datasets. It helps reduce the complexity of machine learning models, improve performance, mitigate overfitting, and decrease computation time. This paper presents a novel feature selection framework, shap-select. The framework conducts a linear or logistic regression of the target on the Shapley values of the features, on the validation set, and uses the signs and significance levels of the regression coefficients to implement an efficient heuristic for feature selection in tabular regression and classification tasks. We evaluate shap-select on the Kaggle credit card fraud dataset, demonstrating its effectiveness compared to established methods such as Recursive Feature Elimination (RFE), HISEL (a mutual information-based feature selection method), Boruta and a simpler Shapley value-based method. Our findings show that shap-select combines interpretability, computational efficiency, and performance, offering a robust solution for feature selection.

Machine Learning

What problem does this paper attempt to address?

The paper attempts to address the problem of effective feature selection in high-dimensional datasets. Specifically, the authors propose a new feature selection framework called shap-select, which combines SHAP values (Shapley Additive Explanations) and statistical significance testing. The aim is to improve the efficiency, interpretability, and performance of feature selection. The paper mentions that existing feature selection methods face issues such as high computational complexity, potential noise introduction, and overfitting when dealing with high-dimensional datasets. These issues are particularly prominent in fields like healthcare, finance, and bioinformatics. Shap-select addresses these problems by performing linear or logistic regression on the SHAP values of the target variable and features on a validation set, and efficiently filtering features based on the sign and significance level of the regression coefficients. To validate the effectiveness of shap-select, the authors conducted experiments on the Kaggle credit card fraud detection dataset, comparing shap-select with existing methods such as Recursive Feature Elimination (RFE), HISEL (a mutual information-based feature selection method), and Boruta. The experimental results show that shap-select maintains high model performance while having lower computational costs, effectively reducing the number of features and improving the interpretability and predictive performance of the model.

Shap-Select: Lightweight Feature Selection Using SHAP Values and Regression

Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods

LLpowershap: Logistic Loss-based Automated Shapley Values Feature Selection Method

A feature selection method based on Shapley values robust for concept shift in regression

Machine Learning for Data Center Optimizations: Feature Selection Using Shapley Additive exPlanation (SHAP)

REFRESH: Responsible and Efficient Feature Reselection Guided by SHAP Values

Rethinking Data Shapley for Data Selection Tasks: Misleads and Merits

Hybridizing Target- and SHAP-encoded Features for Algorithm Selection in Mixed-variable Black-box Optimization

CHG Shapley: Efficient Data Valuation and Selection towards Trustworthy Machine Learning

CLE-SH: Comprehensive Literal Explanation package for SHapley values by statistical validity

SHAP@k:Efficient and Probably Approximately Correct (PAC) Identification of Top-k Features

Improving the Sampling Strategy in KernelSHAP

Automated Model Selection for Tabular Data

Feature selection integrating Shapley values and mutual information in reinforcement learning: An application in the prediction of post-operative outcomes in patients with end-stage renal disease

MIC-SHAP: An ensemble feature selection method for materials machine learning

A semi-parametric approach to feature selection in high-dimensional linear regression models

Assessment of feature selection for student academic performance through machine learning classification

Model interpretability of financial fraud detection by group SHAP

Comparative performance analysis of Boruta, SHAP, and Borutashap for disease diagnosis: A study with multiple machine learning algorithms

Data Selection for Fine-tuning Large Language Models Using Transferred Shapley Values