Abstract:Crafting effective features is a crucial yet labor-intensive and domain-specific task within machine learning pipelines. Fortunately, recent advancements in Large Language Models (LLMs) have shown promise in automating various data science tasks, including feature engineering. But despite this potential, evaluations thus far are primarily based on the end performance of a complete ML pipeline, providing limited insight into precisely how LLMs behave relative to human experts in feature engineering. To address this gap, we propose ELF-Gym, a framework for Evaluating LLM-generated Features. We curated a new dataset from historical Kaggle competitions, including 251 "golden" features used by top-performing teams. ELF-Gym then quantitatively evaluates LLM-generated features by measuring their impact on downstream model performance as well as their alignment with expert-crafted features through semantic and functional similarity assessments. This approach provides a more comprehensive evaluation of disparities between LLMs and human experts, while offering valuable insights into specific areas where LLMs may have room for improvement. For example, using ELF-Gym we empirically demonstrate that, in the best-case scenario, LLMs can semantically capture approximately 56% of the golden features, but at the more demanding implementation level this overlap drops to 13%. Moreover, in other cases LLMs may fail completely, particularly on datasets that require complex features, indicating broad potential pathways for improvement.

What problem does this paper attempt to address?

The problem this paper attempts to address is the evaluation of large language models (LLMs) in feature engineering tasks, particularly in their ability to generate features required for tabular prediction tasks. Although LLMs have shown potential in automating various data science tasks, current evaluations are primarily based on the final performance of the entire machine learning pipeline, which fails to provide in-depth insights into the specific behaviors and gaps of LLMs in feature engineering compared to human experts. To fill this gap, the authors propose the ELF-Gym framework for evaluating features generated by LLMs. Specifically, the framework includes the following aspects: 1. **Dataset Construction**: The authors collected 251 "golden" features used by top teams from historical Kaggle competitions, which are considered high-quality features. 2. **Quantitative Evaluation**: The quality of features generated by LLMs is assessed by measuring their impact on downstream model performance and their semantic and functional similarity to expert-generated features. 3. **Research Questions**: - **RQ1**: Can LLMs discover "golden" features by reasoning about data descriptions and patterns? - **RQ2**: In generating which types of "golden" features do LLMs excel, and in which types do they struggle? Through these evaluations, the ELF-Gym framework aims to provide a more comprehensive perspective, revealing the strengths and weaknesses of LLMs in feature engineering, thereby offering valuable insights for improving the application of LLMs in data science tasks. For example, the study shows that, in the best-case scenario, LLMs can semantically capture about 56% of the "golden" features, but this proportion drops to 13% at the implementation level, especially on datasets requiring complex features, where LLMs perform particularly poorly. These findings highlight the importance of standardized benchmarks and robust evaluation tools.

ELF-Gym: Evaluating Large Language Models Generated Features for Tabular Prediction

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

Large Language Models Can Automatically Engineer Features for Few-Shot Tabular Learning

Large Language Models Engineer Too Many Simple Features For Tabular Data

LLM-Select: Feature Selection with Large Language Models

Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists

Large language models (LLMs) on tabular data: Prediction, generation, and understanding-a survey

Unleashing the Potential of Large Language Models for Predictive Tabular Tasks in Data Science

LLMEval: A Preliminary Study on How to Evaluate Large Language Models

Generating Realistic Tabular Data with Large Language Models

Evaluating Large Language Models on Time Series Feature Understanding: A Comprehensive Taxonomy and Benchmark

Integrating Stock Features and Global Information via Large Language Models for Enhanced Stock Return Prediction

Large Language Models(LLMs) on Tabular Data: Prediction, Generation, and Understanding -- A Survey

TL-Training: A Task-Feature-Based Framework for Training Large Language Models in Tool Use

Exploring Large Language Models for Feature Selection: A Data-centric Perspective

Large Language Model Enhanced Machine Learning Estimators for Classification

Evolutionary Large Language Model for Automated Feature Transformation

Leveraging Large Language Models for NLG Evaluation: Advances and Challenges

Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation Dataset

Scaling Generative Tabular Learning for Large Language Models