Abstract:Large language models (LLMs) have manifested strong ability to generate codes for productive activities. However, current benchmarks for code synthesis, such as HumanEval, MBPP, and DS-1000, are predominantly oriented towards introductory tasks on algorithm and data science, insufficiently satisfying challenging requirements prevalent in real-world coding. To fill this gap, we propose NaturalCodeBench (NCB), a challenging code benchmark designed to mirror the complexity and variety of scenarios in real coding tasks. NCB comprises 402 high-quality problems in Python and Java, meticulously selected from natural user queries from online coding services, covering 6 different domains. Noting the extraordinary difficulty in creating testing cases for real-world queries, we also introduce a semi-automated pipeline to enhance the efficiency of test case construction. Comparing with manual solutions, it achieves an efficiency increase of more than 4 times. Our systematic experiments on 39 LLMs find that performance gaps on NCB between models with close HumanEval scores could still be significant, indicating a lack of focus on practical code synthesis scenarios or over-specified optimization on HumanEval. On the other hand, even the best-performing GPT-4 is still far from satisfying on NCB. The evaluation toolkit and development set are available at

What problem does this paper attempt to address?

The main problem this paper attempts to address is that current code synthesis evaluation benchmarks (such as HumanEval, MBPP, etc.) mainly focus on fundamental tasks in algorithms and data science, failing to adequately reflect the complexity and diversity of real-world coding tasks. To this end, the authors propose NATURAL CODEBENCH (NCB), a challenging code benchmark set designed to simulate the complexity and diversity of actual coding tasks. NCB contains 402 high-quality Python and Java problems, carefully selected from natural user queries in online coding services, covering 6 different domains. Specifically, the paper addresses the following issues: 1. **Limitations of Benchmark Sets**: Existing code synthesis evaluation benchmark sets (such as HumanEval, MBPP, etc.) mainly focus on basic algorithm and data science tasks, failing to cover the complex engineering application needs in the real world. This leads to inadequacies in these benchmark sets when evaluating the actual code generation capabilities of large language models (LLMs). 2. **Complexity of Real-World Problems**: User needs in the real world are usually more complex and diverse, while the problems in existing benchmark sets are relatively simple. Therefore, there is a need for a benchmark set that can reflect actual user needs. 3. **Efficiency of Test Case Generation**: Creating high-quality test cases is key to evaluating code generation performance, but this process is very time-consuming and labor-intensive. The paper proposes a semi-automated pipeline that can significantly improve the efficiency of test case generation. 4. **Differences in Model Performance**: Through systematic experiments on 39 LLMs using NCB, the paper finds that even models with similar performance on HumanEval can show significant differences in performance on NCB. This indicates that some existing LLMs may be over-optimized for specific benchmark sets and perform poorly in actual applications. In summary, by proposing NATURAL CODEBENCH, this paper fills the gap in existing benchmark sets in evaluating actual code generation capabilities, providing important tools and data support for researching and improving the performance of LLMs in real-world application scenarios.

NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Evaluating and Aligning CodeLLMs on Human Preference

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code

CodeEditorBench: Evaluating Code Editing Capability of Large Language Models

ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code

ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks

CoderUJB: An Executable and Unified Java Benchmark for Practical Programming Scenarios

CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks

InfiCoder-Eval: Systematically Evaluating the Question-Answering Capabilities of Code Large Language Models.

InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models

MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code Generation

Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM

Benchmarking Language Model Creativity: A Case Study on Code Generation

HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations

HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks

xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval