NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts

Shudan Zhang,Hanlin Zhao,Xiao Liu,Qinkai Zheng,Zehan Qi,Xiaotao Gu,Xiaohan Zhang,Yuxiao Dong,Jie Tang
2024-05-08
Abstract:Large language models (LLMs) have manifested strong ability to generate codes for productive activities. However, current benchmarks for code synthesis, such as HumanEval, MBPP, and DS-1000, are predominantly oriented towards introductory tasks on algorithm and data science, insufficiently satisfying challenging requirements prevalent in real-world coding. To fill this gap, we propose NaturalCodeBench (NCB), a challenging code benchmark designed to mirror the complexity and variety of scenarios in real coding tasks. NCB comprises 402 high-quality problems in Python and Java, meticulously selected from natural user queries from online coding services, covering 6 different domains. Noting the extraordinary difficulty in creating testing cases for real-world queries, we also introduce a semi-automated pipeline to enhance the efficiency of test case construction. Comparing with manual solutions, it achieves an efficiency increase of more than 4 times. Our systematic experiments on 39 LLMs find that performance gaps on NCB between models with close HumanEval scores could still be significant, indicating a lack of focus on practical code synthesis scenarios or over-specified optimization on HumanEval. On the other hand, even the best-performing GPT-4 is still far from satisfying on NCB. The evaluation toolkit and development set are available at
Computation and Language,Machine Learning,Software Engineering
What problem does this paper attempt to address?
The main problem this paper attempts to address is that current code synthesis evaluation benchmarks (such as HumanEval, MBPP, etc.) mainly focus on fundamental tasks in algorithms and data science, failing to adequately reflect the complexity and diversity of real-world coding tasks. To this end, the authors propose NATURAL CODEBENCH (NCB), a challenging code benchmark set designed to simulate the complexity and diversity of actual coding tasks. NCB contains 402 high-quality Python and Java problems, carefully selected from natural user queries in online coding services, covering 6 different domains. Specifically, the paper addresses the following issues: 1. **Limitations of Benchmark Sets**: Existing code synthesis evaluation benchmark sets (such as HumanEval, MBPP, etc.) mainly focus on basic algorithm and data science tasks, failing to cover the complex engineering application needs in the real world. This leads to inadequacies in these benchmark sets when evaluating the actual code generation capabilities of large language models (LLMs). 2. **Complexity of Real-World Problems**: User needs in the real world are usually more complex and diverse, while the problems in existing benchmark sets are relatively simple. Therefore, there is a need for a benchmark set that can reflect actual user needs. 3. **Efficiency of Test Case Generation**: Creating high-quality test cases is key to evaluating code generation performance, but this process is very time-consuming and labor-intensive. The paper proposes a semi-automated pipeline that can significantly improve the efficiency of test case generation. 4. **Differences in Model Performance**: Through systematic experiments on 39 LLMs using NCB, the paper finds that even models with similar performance on HumanEval can show significant differences in performance on NCB. This indicates that some existing LLMs may be over-optimized for specific benchmark sets and perform poorly in actual applications. In summary, by proposing NATURAL CODEBENCH, this paper fills the gap in existing benchmark sets in evaluating actual code generation capabilities, providing important tools and data support for researching and improving the performance of LLMs in real-world application scenarios.