Abstract:Task automation has been greatly empowered by the recent advances in Large Language Models (LLMs) via Python code, where the tasks ranging from software engineering development to general-purpose reasoning. While current benchmarks have shown that LLMs can solve tasks using programs like human developers, the majority of their evaluations are limited to short and self-contained algorithmic tasks or standalone function calls. Solving challenging and practical requires the capability of utilizing diverse function calls as tools to efficiently implement functionalities like data analysis and web development. In addition, using multiple tools to solve a task needs compositional reasoning by accurately understanding complex instructions. Fulfilling both of these characteristics can pose a great challenge for <a class="link-external link-http" href="http://LLMs.To" rel="external noopener nofollow">this http URL</a> assess how well LLMs can solve challenging and practical tasks via programs, we introduce BigCodeBench, a benchmark that challenges LLMs to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained tasks. To evaluate LLMs rigorously, each task encompasses 5.6 test cases with an average branch coverage of 99%. In addition, we propose a natural-language-oriented variant of BigCodeBench, BigCodeBench-Instruct, that automatically transforms the original docstrings into short instructions only with essential information. Our extensive evaluation of 60 LLMs shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%. The results underscore the need for further advancements in this area.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the capabilities of large language models (LLMs) in solving complex and practical programming tasks. Current benchmarks mainly focus on short, self - contained algorithmic tasks or independent function calls, ignoring two key characteristics required for programming tasks in the real world: 1. **Diverse function calls as tools**: Complex programming tasks usually require calling a sequence of multiple functions from different libraries to achieve functions such as data analysis, web development, etc. 2. **Understanding and execution of complex instructions**: Programming tasks in the real world often involve complex instructions, which require the model to have combinatorial reasoning ability in order to execute a series of functions in the correct order (for example, input data processing, error message handling, specific output formatting, etc.). To solve these problems, the paper introduces a new benchmark set - **BigCodeBench**, which challenges LLMs to solve problems by calling multiple functions in 1,140 fine - grained tasks from 139 libraries and 7 domains. Each task contains an average of 5.6 test cases and has 99% branch coverage. In addition, a natural - language - oriented variant - **BigCodeBench - Instruct** is also proposed, which automatically converts the original docstring into a short instruction containing only necessary information. Through extensive evaluation of 60 LLMs, the study found that even the most advanced LLM (such as GPT - 4) still has limited performance in solving these complex tasks, with a maximum score of only 60%, far lower than the 97% performance of humans. This indicates that LLMs still need further improvement in understanding and executing complex instructions. ### Formula Summary The paper does not involve specific mathematical formulas, but mentions some technical indicators and statistical data, such as: - **Pass@K**: Used to evaluate the functional correctness of generated code snippets. - **Cyclomatic Complexity**: An indicator for measuring code complexity, representing the number of independent paths in the task solution. ### Key Problem Summary - **Problem Description**: Evaluate the capabilities of LLMs in solving complex and practical programming tasks. - **Solution**: Introduce the BigCodeBench benchmark set, covering tasks with diverse function calls and complex instructions. - **Research Finding**: LLMs have limited performance in solving complex tasks, especially in understanding complex instructions. Hope this information can help you better understand the core problems and contributions of this paper. If you have more questions, feel free to continue asking!

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code

CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation

ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks

TaskBench: Benchmarking Large Language Models for Task Automation

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation

TestBench: Evaluating Class-Level Test Case Generation Capability of Large Language Models

ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code

PyBench: Evaluating LLM Agent on various real-world coding tasks

EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations

NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts

AgentBench: Evaluating LLMs as Agents

CodeEditorBench: Evaluating Code Editing Capability of Large Language Models

Evaluating Large Language Models in Class-Level Code Generation

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

CoderUJB: An Executable and Unified Java Benchmark for Practical Programming Scenarios

FullStack Bench: Evaluating LLMs as Full Stack Coders

Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review

Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios

Evaluating and Aligning CodeLLMs on Human Preference