Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists

Michał Pietruszka,Łukasz Borchmann,Aleksander Jędrosz,Paweł Morawiecki
2024-10-31
Abstract:We present a benchmark for large language models designed to tackle one of the most knowledge-intensive tasks in data science: writing feature engineering code, which requires domain knowledge in addition to a deep understanding of the underlying problem and data structure. The model is provided with a dataset description in a prompt and asked to generate code transforming it. The evaluation score is derived from the improvement achieved by an XGBoost model fit on the modified dataset compared to the original data. By an extensive evaluation of state-of-the-art models and comparison to well-established benchmarks, we demonstrate that the FeatEng of our proposal can cheaply and efficiently assess the broad capabilities of LLMs, in contrast to the existing methods.
Computation and Language
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve The paper aims to address the limitations in existing large-scale language model (LLM) benchmarks, particularly their inadequacies in evaluating the practical usability of models, the application of world knowledge, the integration of complex skills, and resistance to exploitation strategies. Specifically, the paper proposes a new benchmark called FeatEng to assess LLM performance in feature engineering tasks. Feature engineering is a knowledge-intensive data science task that requires domain knowledge and a deep understanding of problem and data structures. ### Main Issues 1. **Practical Usability**: Existing benchmarks often focus on specific aspects such as language understanding, world knowledge, code generation, or mathematical reasoning, but lack evaluation of the comprehensive application of these skills to real-world problems. 2. **Application of World Knowledge**: Many benchmarks only verify the breadth of a model's knowledge without assessing its ability to dynamically and effectively apply knowledge in different functional contexts. 3. **Integration of Complex Skills**: Current benchmarks typically evaluate different capabilities of models in isolated environments, failing to assess their ability to seamlessly integrate multiple skills in multifaceted tasks. 4. **Resistance to Exploitation Strategies**: Many benchmarks are susceptible to exploitation through pattern recognition or memorization strategies, failing to ensure that score improvements genuinely reflect enhancements in model capabilities. ### Solution The paper addresses these issues by proposing the FeatEng benchmark. This benchmark requires models to generate feature engineering code based on a given dataset description to improve the performance of an XGBoost model on the modified dataset. Specifically: - **Practical Usability**: The tasks are based on real-world problems, ensuring that model improvements translate into practical benefits. - **Application of World Knowledge**: Models need to creatively apply domain knowledge to generate features that improve predictive performance. - **Integration of Complex Skills**: The tasks require the combination of data interpretation, reasoning, code generation, and instruction following skills. - **Resistance to Exploitation Strategies**: By requiring unique solutions for diverse datasets, the benchmark reduces the likelihood of exploitation through memorization or shallow strategies. ### Experimental Results The paper extensively evaluates multiple LLMs and compares them with existing benchmarks. The results show that certain models (such as O1-PREVIEW) perform exceptionally well on the FeatEng benchmark, significantly enhancing feature engineering effectiveness. This indicates that the FeatEng benchmark can effectively assess the comprehensive capabilities of models in real-world data science tasks. ### Conclusion By introducing the FeatEng benchmark, the paper fills the gap in existing benchmarks regarding the evaluation of LLMs' practical application capabilities, providing researchers and developers with a more comprehensive and reliable assessment tool.