Abstract:We introduce DA-Code, a code generation benchmark specifically designed to assess LLMs on agent-based data science tasks. This benchmark features three core elements: First, the tasks within DA-Code are inherently challenging, setting them apart from traditional code generation tasks and demanding advanced coding skills in grounding and planning. Second, examples in DA-Code are all based on real and diverse data, covering a wide range of complex data wrangling and analytics tasks. Third, to solve the tasks, the models must utilize complex data science programming languages, to perform intricate data processing and derive the answers. We set up the benchmark in a controllable and executable environment that aligns with real-world data analysis scenarios and is scalable. The annotators meticulously design the evaluation suite to ensure the accuracy and robustness of the evaluation. We develop the DA-Agent baseline. Experiments show that although the baseline performs better than other existing frameworks, using the current best LLMs achieves only 30.5% accuracy, leaving ample room for improvement. We release our benchmark at <a class="link-external link-https" href="https://da-code-bench.github.io" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the issue of evaluating the capabilities of large language models (LLMs) in handling complex data science tasks. Specifically, the paper proposes a benchmark named **DA-Code**, designed to assess the performance of LLMs in agent-based data science tasks. #### Main Issues: 1. **Complex Task Challenges**: Existing code generation tasks are usually simpler, whereas the tasks in the DA-Code benchmark are more challenging, requiring advanced programming skills and planning abilities. 2. **Diversity of Real Data**: Data sources in existing benchmarks are often singular and not realistic, while DA-Code uses real and diverse data, covering a wide range of data processing and analysis tasks. 3. **Complex Programming Languages**: Tasks in existing benchmarks typically require only simple programming languages, whereas DA-Code requires models to use complex programming languages (such as Python, SQL, and Bash) to complete tasks. 4. **Autonomous Decision-Making Ability**: Existing models lack autonomous decision-making and planning abilities when handling complex data science tasks, and DA-Code aims to evaluate models in these aspects. ### Specific Goals: - **Design Complex Tasks**: Create a series of challenging data science tasks, covering data cleaning, machine learning, and exploratory data analysis. - **Use Real Data**: Ensure all tasks are based on real-world data, enhancing the authenticity and complexity of the tasks. - **Evaluate Complex Programming Abilities**: Require models to use multiple programming languages for complex data processing and analysis. - **Build a Controllable Environment**: Provide a controllable and executable environment that simulates real-world data analysis scenarios and is scalable. - **Develop an Evaluation System**: Design a detailed evaluation system to ensure the accuracy and robustness of the assessments. Through these goals, DA-Code hopes to advance the application of LLMs in the field of data science, enabling them to better handle complex tasks in the real world.

DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models

CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges

DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?

AgentBench: Evaluating LLMs as Agents

Benchmarking Data Science Agents

DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery

AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents

AICoderEval: Improving AI Domain Code Generation of Large Language Models

ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks

LAMBDA: A Large Model Based Data Agent

ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code

How Well Do LLMs Generate Code for Different Application Domains? Benchmark and Evaluation

LegalAgentBench: Evaluating LLM Agents in Legal Domain

BLADE: Benchmarking Language Model Agents for Data-Driven Science

SciCode: A Research Coding Benchmark Curated by Scientists

InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks

DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories

DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation