Neuro-Symbolic Embedding for Short and Effective Feature Selection via Autoregressive Generation

Nanxu Gong,Wangyang Ying,Dongjie Wang,Yanjie Fu
2024-04-26
Abstract:Feature selection aims to identify the optimal feature subset for enhancing downstream models. Effective feature selection can remove redundant features, save computational resources, accelerate the model learning process, and improve the model overall performance. However, existing works are often time-intensive to identify the effective feature subset within high-dimensional feature spaces. Meanwhile, these methods mainly utilize a single downstream task performance as the selection criterion, leading to the selected subsets that are not only redundant but also lack generalizability. To bridge these gaps, we reformulate feature selection through a neuro-symbolic lens and introduce a novel generative framework aimed at identifying short and effective feature subsets. More specifically, we found that feature ID tokens of the selected subset can be formulated as symbols to reflect the intricate correlations among features. Thus, in this framework, we first create a data collector to automatically collect numerous feature selection samples consisting of feature ID tokens, model performance, and the measurement of feature subset redundancy. Building on the collected data, an encoder-decoder-evaluator learning paradigm is developed to preserve the intelligence of feature selection into a continuous embedding space for efficient search. Within the learned embedding space, we leverage a multi-gradient search algorithm to find more robust and generalized embeddings with the objective of improving model performance and reducing feature subset redundancy. These embeddings are then utilized to reconstruct the feature ID tokens for executing the final feature selection. Ultimately, comprehensive experiments and case studies are conducted to validate the effectiveness of the proposed framework.
Machine Learning
What problem does this paper attempt to address?
This paper aims to solve two main problems in feature selection: 1. **Efficient identification of effective features**: As the dimension of the feature space increases, it becomes more and more difficult to capture the complex correlations between features, which leads to a significant increase in the time complexity of the feature selection process. Existing methods are less efficient in dealing with high - dimensional feature spaces and it is difficult to effectively identify the optimal feature subset. 2. **Simultaneously consider redundancy and effectiveness as feature selection criteria**: Most traditional methods mainly optimize the performance of a single downstream task, which may lead to the selected feature subset being too specific to a certain model and lacking generalization ability. Therefore, a multi - objective search method that can balance redundancy and effectiveness is needed to avoid over - fitting and improve the overall performance of the model. To overcome these limitations, the paper proposes a new feature selection framework - Neuro - Symbolic Embedding (NS), which efficiently identifies short and effective feature subsets through autoregressive generation. Specifically, this framework regards feature selection as a sequence generation task, uses reinforcement learning to collect training data, and embeds the knowledge of feature selection into a continuous embedding space through the encoder - decoder - evaluator learning paradigm, thereby achieving efficient search and optimization. ### Main contributions 1. **Neuro - symbolic perspective**: Redefine the feature selection task as a sequence generation task, and capture the complex correlations between features by symbolically representing feature IDs. 2. **Data collection mechanism**: Design a data collection mechanism based on reinforcement learning to automatically collect samples including feature subsets, model performance and feature redundancy. 3. **Encoder - decoder - evaluator framework**: Construct a complex encoder - decoder - evaluator learning framework, which converts the discrete feature selection process into an optimization problem in a continuous space, improving the search efficiency and effect. 4. **Multi - gradient search algorithm**: In the learned embedding space, use the multi - gradient search algorithm to find more robust and generalized embeddings to improve model performance and reduce the redundancy of feature subsets. 5. **Experimental verification**: Verify the effectiveness of the proposed framework through extensive experiments and case studies on 16 real - world datasets. ### Method overview 1. **Data collection**: - **Performance collection**: Collect performance data of feature subsets through supervised and unsupervised methods. The supervised method uses random forest as a performance evaluator, and the unsupervised method adopts the average Laplacian Score method. - **Redundancy collection**: Calculate the redundancy of feature subsets, and use methods such as mutual information, covariance and Pearson correlation coefficient to quantify the redundancy between features. 2. **Data augmentation**: - Generate more training samples by shuffling the order of feature subsets, increase data diversity and reduce over - fitting. 3. **Feature subset embedding model**: - **Encoder**: Use Variational Transformer to embed the feature subset sequence into a continuous embedding vector. - **Decoder**: Reconstruct the feature subset sequence based on the Transformer architecture. - **Evaluator**: Evaluate the prediction performance and redundancy of the feature subset, and provide gradient information for quickly searching for the optimal feature subset. 4. **Joint optimization**: - Minimize the reconstruction loss, evaluator loss and KL divergence loss to form a joint loss function and balance the influence of different loss terms. 5. **Optimal embedding search and reconstruction**: - Use the gradient ascent method to search for the optimal embedding in the embedding space, and then decode to generate the final feature subset. ### Experimental results The paper verifies the superiority of the proposed method in improving downstream task performance, reducing feature redundancy and improving model generalization ability through experiments on 16 datasets. The experimental results show that this method performs well on a variety of tasks and datasets and has high practical value.