Feature Selection as Deep Sequential Generative Learning

Wangyang Ying,Dongjie Wang,Haifeng Chen,Yanjie Fu
2024-03-07
Abstract:Feature selection aims to identify the most pattern-discriminative feature subset. In prior literature, filter (e.g., backward elimination) and embedded (e.g., Lasso) methods have hyperparameters (e.g., top-K, score thresholding) and tie to specific models, thus, hard to generalize; wrapper methods search a feature subset in a huge discrete space and is computationally costly. To transform the way of feature selection, we regard a selected feature subset as a selection decision token sequence and reformulate feature selection as a deep sequential generative learning task that distills feature knowledge and generates decision sequences. Our method includes three steps: (1) We develop a deep variational transformer model over a joint of sequential reconstruction, variational, and performance evaluator losses. Our model can distill feature selection knowledge and learn a continuous embedding space to map feature selection decision sequences into embedding vectors associated with utility scores. (2) We leverage the trained feature subset utility evaluator as a gradient provider to guide the identification of the optimal feature subset embedding;(3) We decode the optimal feature subset embedding to autoregressively generate the best feature selection decision sequence with autostop. Extensive experimental results show this generative perspective is effective and generic, without large discrete search space and expert-specific hyperparameters.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitations of existing feature selection methods. Specifically: 1. **Filter Methods**: These methods usually rank features based on a certain scoring criterion (for example, the correlation between features and labels) and select the top \( k \) features as the optimal feature subset. However, they often overlook the relationships between features, are sensitive to data distribution, and are non - learning, so they perform poorly on complex datasets. 2. **Embedded Methods**: These methods select features by jointly optimizing feature selection and downstream prediction tasks. For example, LASSO shrinks feature coefficients by optimizing regression and regularization losses. But they rely on strong structural assumptions (such as sparse coefficients) and specific downstream models (such as regression), which makes them not flexible enough. 3. **Wrapper Methods**: These methods regard feature selection as a search problem in the large - scale discrete feature combination space, and usually use evolutionary algorithms or genetic algorithms to collaborate with downstream machine - learning models. However, as the number of features increases, the search space grows exponentially and the computational cost is very high. To overcome these problems, the authors propose a new deep - sequence - generation - learning framework, which regards feature selection as a continuous optimization problem rather than the traditional discrete - selection process. Specifically, they propose the following innovations: - **Generative Perspective**: Regard feature selection as a deep - sequence - generation AI task, that is, convert the selection of feature subsets into a continuous optimization problem. - **EOG (Embedding - Optimization - Generation) Framework**: It includes three steps: 1. **Embedding**: Develop a variational transformer model to learn the feature - subset embedding space by jointly optimizing the sequence - reconstruction loss, the feature - subset - accuracy - evaluation loss, and the variational - distribution - alignment loss (i.e., Kullback - Leibler loss). 2. **Optimization**: Use the trained feature - subset - utility - evaluator to generate gradient information to guide the identification of the optimal - feature - subset - embedding. 3. **Generation**: Decode the optimal - embedding vector and generate the best - feature - selection - decision sequence autoregressively. Through this method, the authors hope to effectively select the optimal feature subset without large - scale discrete search and improve the effectiveness and generalization ability of feature selection. ### Formula Summary - The objective of feature selection can be formalized as: \[ t^*=\psi(E^*) = \arg\max_{E\in E}M(X[\psi(E)], y) \] where: - \(\psi\) is the decoder, which is used to generate a feature - token sequence from any embedding \(E\); - \(E^*\) is the optimal - feature - subset - embedding; - \(M\) is the downstream machine - learning task; - \(X[]\) represents using a mapping table to convert the feature - token sequence into a feature subset. This new method not only improves the effect of feature selection but also enhances its generalization ability in various data domains.