Abstract:We present a benchmark for large language models designed to tackle one of the most knowledge-intensive tasks in data science: writing feature engineering code, which requires domain knowledge in addition to a deep understanding of the underlying problem and data structure. The model is provided with a dataset description in a prompt and asked to generate code transforming it. The evaluation score is derived from the improvement achieved by an XGBoost model fit on the modified dataset compared to the original data. By an extensive evaluation of state-of-the-art models and comparison to well-established benchmarks, we demonstrate that the FeatEng of our proposal can cheaply and efficiently assess the broad capabilities of LLMs, in contrast to the existing methods.

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve The paper aims to address the limitations in existing large-scale language model (LLM) benchmarks, particularly their inadequacies in evaluating the practical usability of models, the application of world knowledge, the integration of complex skills, and resistance to exploitation strategies. Specifically, the paper proposes a new benchmark called FeatEng to assess LLM performance in feature engineering tasks. Feature engineering is a knowledge-intensive data science task that requires domain knowledge and a deep understanding of problem and data structures. ### Main Issues 1. **Practical Usability**: Existing benchmarks often focus on specific aspects such as language understanding, world knowledge, code generation, or mathematical reasoning, but lack evaluation of the comprehensive application of these skills to real-world problems. 2. **Application of World Knowledge**: Many benchmarks only verify the breadth of a model's knowledge without assessing its ability to dynamically and effectively apply knowledge in different functional contexts. 3. **Integration of Complex Skills**: Current benchmarks typically evaluate different capabilities of models in isolated environments, failing to assess their ability to seamlessly integrate multiple skills in multifaceted tasks. 4. **Resistance to Exploitation Strategies**: Many benchmarks are susceptible to exploitation through pattern recognition or memorization strategies, failing to ensure that score improvements genuinely reflect enhancements in model capabilities. ### Solution The paper addresses these issues by proposing the FeatEng benchmark. This benchmark requires models to generate feature engineering code based on a given dataset description to improve the performance of an XGBoost model on the modified dataset. Specifically: - **Practical Usability**: The tasks are based on real-world problems, ensuring that model improvements translate into practical benefits. - **Application of World Knowledge**: Models need to creatively apply domain knowledge to generate features that improve predictive performance. - **Integration of Complex Skills**: The tasks require the combination of data interpretation, reasoning, code generation, and instruction following skills. - **Resistance to Exploitation Strategies**: By requiring unique solutions for diverse datasets, the benchmark reduces the likelihood of exploitation through memorization or shallow strategies. ### Experimental Results The paper extensively evaluates multiple LLMs and compares them with existing benchmarks. The results show that certain models (such as O1-PREVIEW) perform exceptionally well on the FeatEng benchmark, significantly enhancing feature engineering effectiveness. This indicates that the FeatEng benchmark can effectively assess the comprehensive capabilities of models in real-world data science tasks. ### Conclusion By introducing the FeatEng benchmark, the paper fills the gap in existing benchmarks regarding the evaluation of LLMs' practical application capabilities, providing researchers and developers with a more comprehensive and reliable assessment tool.

Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists

LLM4DS: Evaluating Large Language Models for Data Science Code Generation

What is the best model? Application-driven Evaluation for Large Language Models

ELF-Gym: Evaluating Large Language Models Generated Features for Tabular Prediction

Large Linguistic Models: Analyzing theoretical linguistic abilities of LLMs

Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate

LLMEval: A Preliminary Study on How to Evaluate Large Language Models

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks

SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading

Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence

SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research

Evaluating Language Models for Generating and Judging Programming Feedback

Are Large Language Models Good Statisticians?

Through the Lens of Core Competency: Survey on Evaluation of Large Language Models

A Survey on Evaluation of Large Language Models

Enterprise Benchmarks for Large Language Model Evaluation

MLLM-DataEngine: An Iterative Refinement Approach for MLLM

A Survey on Evaluation of Large Language ModelsJust Accepted

An Interdisciplinary Outlook on Large Language Models for Scientific Research