BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Terry Yue Zhuo,Minh Chien Vu,Jenny Chim,Han Hu,Wenhao Yu,Ratnadira Widyasari,Imam Nur Bani Yusuf,Haolan Zhan,Junda He,Indraneil Paul,Simon Brunner,Chen Gong,Thong Hoang,Armel Randy Zebaze,Xiaoheng Hong,Wen-Ding Li,Jean Kaddour,Ming Xu,Zhihan Zhang,Prateek Yadav,Naman Jain,Alex Gu,Zhoujun Cheng,Jiawei Liu,Qian Liu,Zijian Wang,David Lo,Binyuan Hui,Niklas Muennighoff,Daniel Fried,Xiaoning Du,Harm de Vries,Leandro Von Werra
2024-10-08
Abstract:Task automation has been greatly empowered by the recent advances in Large Language Models (LLMs) via Python code, where the tasks ranging from software engineering development to general-purpose reasoning. While current benchmarks have shown that LLMs can solve tasks using programs like human developers, the majority of their evaluations are limited to short and self-contained algorithmic tasks or standalone function calls. Solving challenging and practical requires the capability of utilizing diverse function calls as tools to efficiently implement functionalities like data analysis and web development. In addition, using multiple tools to solve a task needs compositional reasoning by accurately understanding complex instructions. Fulfilling both of these characteristics can pose a great challenge for <a class="link-external link-http" href="http://LLMs.To" rel="external noopener nofollow">this http URL</a> assess how well LLMs can solve challenging and practical tasks via programs, we introduce BigCodeBench, a benchmark that challenges LLMs to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained tasks. To evaluate LLMs rigorously, each task encompasses 5.6 test cases with an average branch coverage of 99%. In addition, we propose a natural-language-oriented variant of BigCodeBench, BigCodeBench-Instruct, that automatically transforms the original docstrings into short instructions only with essential information. Our extensive evaluation of 60 LLMs shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%. The results underscore the need for further advancements in this area.
Software Engineering,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the capabilities of large language models (LLMs) in solving complex and practical programming tasks. Current benchmarks mainly focus on short, self - contained algorithmic tasks or independent function calls, ignoring two key characteristics required for programming tasks in the real world: 1. **Diverse function calls as tools**: Complex programming tasks usually require calling a sequence of multiple functions from different libraries to achieve functions such as data analysis, web development, etc. 2. **Understanding and execution of complex instructions**: Programming tasks in the real world often involve complex instructions, which require the model to have combinatorial reasoning ability in order to execute a series of functions in the correct order (for example, input data processing, error message handling, specific output formatting, etc.). To solve these problems, the paper introduces a new benchmark set - **BigCodeBench**, which challenges LLMs to solve problems by calling multiple functions in 1,140 fine - grained tasks from 139 libraries and 7 domains. Each task contains an average of 5.6 test cases and has 99% branch coverage. In addition, a natural - language - oriented variant - **BigCodeBench - Instruct** is also proposed, which automatically converts the original docstring into a short instruction containing only necessary information. Through extensive evaluation of 60 LLMs, the study found that even the most advanced LLM (such as GPT - 4) still has limited performance in solving these complex tasks, with a maximum score of only 60%, far lower than the 97% performance of humans. This indicates that LLMs still need further improvement in understanding and executing complex instructions. ### Formula Summary The paper does not involve specific mathematical formulas, but mentions some technical indicators and statistical data, such as: - **Pass@K**: Used to evaluate the functional correctness of generated code snippets. - **Cyclomatic Complexity**: An indicator for measuring code complexity, representing the number of independent paths in the task solution. ### Key Problem Summary - **Problem Description**: Evaluate the capabilities of LLMs in solving complex and practical programming tasks. - **Solution**: Introduce the BigCodeBench benchmark set, covering tasks with diverse function calls and complex instructions. - **Research Finding**: LLMs have limited performance in solving complex tasks, especially in understanding complex instructions. Hope this information can help you better understand the core problems and contributions of this paper. If you have more questions, feel free to continue asking!