ELF-Gym: Evaluating Large Language Models Generated Features for Tabular Prediction

Yanlin Zhang,Ning Li,Quan Gan,Weinan Zhang,David Wipf,Minjie Wang
2024-10-13
Abstract:Crafting effective features is a crucial yet labor-intensive and domain-specific task within machine learning pipelines. Fortunately, recent advancements in Large Language Models (LLMs) have shown promise in automating various data science tasks, including feature engineering. But despite this potential, evaluations thus far are primarily based on the end performance of a complete ML pipeline, providing limited insight into precisely how LLMs behave relative to human experts in feature engineering. To address this gap, we propose ELF-Gym, a framework for Evaluating LLM-generated Features. We curated a new dataset from historical Kaggle competitions, including 251 "golden" features used by top-performing teams. ELF-Gym then quantitatively evaluates LLM-generated features by measuring their impact on downstream model performance as well as their alignment with expert-crafted features through semantic and functional similarity assessments. This approach provides a more comprehensive evaluation of disparities between LLMs and human experts, while offering valuable insights into specific areas where LLMs may have room for improvement. For example, using ELF-Gym we empirically demonstrate that, in the best-case scenario, LLMs can semantically capture approximately 56% of the golden features, but at the more demanding implementation level this overlap drops to 13%. Moreover, in other cases LLMs may fail completely, particularly on datasets that require complex features, indicating broad potential pathways for improvement.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem this paper attempts to address is the evaluation of large language models (LLMs) in feature engineering tasks, particularly in their ability to generate features required for tabular prediction tasks. Although LLMs have shown potential in automating various data science tasks, current evaluations are primarily based on the final performance of the entire machine learning pipeline, which fails to provide in-depth insights into the specific behaviors and gaps of LLMs in feature engineering compared to human experts. To fill this gap, the authors propose the ELF-Gym framework for evaluating features generated by LLMs. Specifically, the framework includes the following aspects: 1. **Dataset Construction**: The authors collected 251 "golden" features used by top teams from historical Kaggle competitions, which are considered high-quality features. 2. **Quantitative Evaluation**: The quality of features generated by LLMs is assessed by measuring their impact on downstream model performance and their semantic and functional similarity to expert-generated features. 3. **Research Questions**: - **RQ1**: Can LLMs discover "golden" features by reasoning about data descriptions and patterns? - **RQ2**: In generating which types of "golden" features do LLMs excel, and in which types do they struggle? Through these evaluations, the ELF-Gym framework aims to provide a more comprehensive perspective, revealing the strengths and weaknesses of LLMs in feature engineering, thereby offering valuable insights for improving the application of LLMs in data science tasks. For example, the study shows that, in the best-case scenario, LLMs can semantically capture about 56% of the "golden" features, but this proportion drops to 13% at the implementation level, especially on datasets requiring complex features, where LLMs perform particularly poorly. These findings highlight the importance of standardized benchmarks and robust evaluation tools.