FeatureLTE: Learning to Estimate Feature Importance

Tianping Zhang,Zhaoyang Wang,Chen Qian,Jian Li,Yin Lou
DOI: https://doi.org/10.1145/3654942
2024-01-01
Abstract:Feature importance scores (FIS) estimation is an important problem in many data-intensive applications. Traditional approaches can be divided into two types; model-specific methods and model-agnostic methods. In this work, we present FeatureLTE, a novel learning-based approach to FIS estimation. For the first time, as we demonstrate through extensive experiments, it is possible to build general-purpose pre-trained models for FIS estimation. Therefore, FIS estimation reduces to prediction outputs from a pre-trained FeatureLTE model. Pre-trained FeatureLTE models enjoy several desired advantages, including accuracy, robustness, efficiency, and evolvability, and FeatureLTE models really begin to shine on large datasets where traditional methods often find themselves unable to scale. We build our pre-trained models for binary classification and regression problems using observations from nearly 1,000 public datasets. We systematically evaluate various design choices of FeatureLTE model construction and carefully design meta features to make sure that they are computationally lightweight. Based on our evaluation, FeatureLTE is on par with the best existing FIS estimators in terms of FIS quality, and achieves up to 339.48x speedup without sacrificing the quality of FIS estimates on large-scale datasets. Finally, we release two pre-trained FeatureLTE models for binary classification and regression problems that are ready to use on almost all tabular datasets, along with the repository of 701 binary classification datasets and 256 regression datasets with pre-computed feature importance scores to promote future research along this direction.
What problem does this paper attempt to address?