DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models

Yiming Huang,Jianwen Luo,Yan Yu,Yitong Zhang,Fangyu Lei,Yifan Wei,Shizhu He,Lifu Huang,Xiao Liu,Jun Zhao,Kang Liu
2024-10-11
Abstract:We introduce DA-Code, a code generation benchmark specifically designed to assess LLMs on agent-based data science tasks. This benchmark features three core elements: First, the tasks within DA-Code are inherently challenging, setting them apart from traditional code generation tasks and demanding advanced coding skills in grounding and planning. Second, examples in DA-Code are all based on real and diverse data, covering a wide range of complex data wrangling and analytics tasks. Third, to solve the tasks, the models must utilize complex data science programming languages, to perform intricate data processing and derive the answers. We set up the benchmark in a controllable and executable environment that aligns with real-world data analysis scenarios and is scalable. The annotators meticulously design the evaluation suite to ensure the accuracy and robustness of the evaluation. We develop the DA-Agent baseline. Experiments show that although the baseline performs better than other existing frameworks, using the current best LLMs achieves only 30.5% accuracy, leaving ample room for improvement. We release our benchmark at <a class="link-external link-https" href="https://da-code-bench.github.io" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the issue of evaluating the capabilities of large language models (LLMs) in handling complex data science tasks. Specifically, the paper proposes a benchmark named **DA-Code**, designed to assess the performance of LLMs in agent-based data science tasks. #### Main Issues: 1. **Complex Task Challenges**: Existing code generation tasks are usually simpler, whereas the tasks in the DA-Code benchmark are more challenging, requiring advanced programming skills and planning abilities. 2. **Diversity of Real Data**: Data sources in existing benchmarks are often singular and not realistic, while DA-Code uses real and diverse data, covering a wide range of data processing and analysis tasks. 3. **Complex Programming Languages**: Tasks in existing benchmarks typically require only simple programming languages, whereas DA-Code requires models to use complex programming languages (such as Python, SQL, and Bash) to complete tasks. 4. **Autonomous Decision-Making Ability**: Existing models lack autonomous decision-making and planning abilities when handling complex data science tasks, and DA-Code aims to evaluate models in these aspects. ### Specific Goals: - **Design Complex Tasks**: Create a series of challenging data science tasks, covering data cleaning, machine learning, and exploratory data analysis. - **Use Real Data**: Ensure all tasks are based on real-world data, enhancing the authenticity and complexity of the tasks. - **Evaluate Complex Programming Abilities**: Require models to use multiple programming languages for complex data processing and analysis. - **Build a Controllable Environment**: Provide a controllable and executable environment that simulates real-world data analysis scenarios and is scalable. - **Develop an Evaluation System**: Design a detailed evaluation system to ensure the accuracy and robustness of the assessments. Through these goals, DA-Code hopes to advance the application of LLMs in the field of data science, enabling them to better handle complex tasks in the real world.