PyBench: Evaluating LLM Agent on various real-world coding tasks

Yaolun Zhang,Yinxu Pan,Yudong Wang,Jie Cai

2024-08-03

Abstract:The LLM Agent, equipped with a code interpreter, is capable of automatically solving real-world coding tasks, such as data analysis and image editing. However, existing benchmarks primarily focus on either simplistic tasks, such as completing a few lines of code, or on extremely complex and specific tasks at the repository level, neither of which are representative of various daily coding tasks. To address this gap, we introduce \textbf{PyBench}, a benchmark encompassing five main categories of real-world tasks, covering more than 10 types of files. Given a high-level user query and related files, the LLM Agent needs to reason and execute Python code via a code interpreter for a few turns before making a formal response to fulfill the user's requirements. Successfully addressing tasks in PyBench demands a robust understanding of various Python packages, superior reasoning capabilities, and the ability to incorporate feedback from executed code. Our evaluations indicate that current open-source LLMs are struggling with these tasks. Hence, we conduct analysis and experiments on four kinds of datasets proving that comprehensive abilities are needed for PyBench. Our fine-tuned 8B size model: \textbf{PyLlama3} achieves an exciting performance on PyBench which surpasses many 33B and 70B size models. Our Benchmark, Training Dataset, and Model are available at: {<a class="link-external link-https" href="https://github.com/Mercury7353/PyBench" rel="external noopener nofollow">this https URL</a>}

Software Engineering,Artificial Intelligence

What problem does this paper attempt to address?

The main aim of this paper is to address the following issues: 1. **Proposing the PyBench Benchmark**: Current large language models (LLMs) are primarily evaluated on simple tasks or extremely complex repository-level tasks in terms of coding capabilities, lacking a comprehensive evaluation standard for everyday practical coding tasks. Therefore, the authors propose the PyBench benchmark, which covers five major categories of real-world tasks, including data preprocessing, text analysis, image and audio editing, complex mathematical operations, and software and website development, to evaluate the practical coding capabilities of LLMs in these areas. 2. **Evaluating Existing Model Performance**: Using the PyBench benchmark, the authors evaluated existing open-source LLMs and proprietary models, finding that most models perform poorly in solving these real-world coding tasks. This indicates that although some models perform well on basic programming tasks, they have limitations when dealing with more complex and variable real-world application scenarios. 3. **Exploring Improvement Strategies**: To enhance the ability of LLMs in solving practical coding tasks, the paper also explores the impact of different training datasets on model performance. Specifically, through continuous pre-training and fine-tuning with homologous datasets, multi-round code interaction datasets, multi-round dialogue datasets, and code-rich corpora, experimental results show that these strategies can significantly improve the model's performance on the PyBench benchmark. In summary, the goal of the paper is to fill the gaps in the existing LLM evaluation system by proposing a more comprehensive and practical evaluation benchmark and exploring how to enhance the ability of LLMs to solve real-world coding tasks through specific datasets and training methods.

PyBench: Evaluating LLM Agent on various real-world coding tasks

ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code

AgentBench: Evaluating LLMs as Agents

ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

CIBench: Evaluating Your LLMs with a Code Interpreter Plugin

CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges

TaskBench: Benchmarking Large Language Models for Task Automation

BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems

BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data

Benchmarking Llama 3 70B for Code Generation: A Comprehensive Evaluation

NLPBench: Evaluating Large Language Models on Solving NLP Problems

WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

TestBench: Evaluating Class-Level Test Case Generation Capability of Large Language Models

CodeEditorBench: Evaluating Code Editing Capability of Large Language Models

AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents

LawBench: Benchmarking Legal Knowledge of Large Language Models

MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation

CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery

InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models

BENCHAGENTS: Automated Benchmark Creation with Agent Interaction