Abstract:How to evaluate Large Language Models (LLMs) in code generation is an open question. Existing benchmarks demonstrate poor alignment with real-world code repositories and are insufficient to evaluate the coding abilities of LLMs. This paper proposes a new benchmark - EvoCodeBench to address the preceding problems, which has three primary advances. (1) EvoCodeBench aligns with real-world repositories in multiple dimensions, e.g., code distributions and dependency distributions. (2) EvoCodeBench offers comprehensive annotations (e.g., requirements, reference code, and reference dependencies), and robust evaluation metrics (e.g., Pass@k and Recall@k). (3) EvoCodeBench is an evolving benchmark to avoid data leakage. We build an automatic pipeline to update EvoCodeBench from the latest repositories. We release the first version - EvoCodeBench-2403, containing 275 samples from 25 real-world repositories. Based on EvoCodeBench, we propose repository-level code generation and evaluate 10 popular LLMs (e.g., gpt-4, gpt-3.5, DeepSeek Coder, StarCoder 2, CodeLLaMa, Gemma, and Qwen 1.5). Our experiments reveal the coding abilities of these LLMs in real-world repositories. For example, the highest Pass@1 of gpt-4 only is 20.73% in our experiments. We also analyze failed cases and summarize the shortcomings of existing LLMs in EvoCodeBench. We release EvoCodeBench, all prompts, and LLMs' completions for further community analysis.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper attempts to address the inconsistency between existing code generation benchmarks and actual code repositories. Current benchmarks are inadequate in evaluating the code generation capabilities of large language models (LLMs) and fail to comprehensively reflect the performance of LLMs in real development processes. Specifically, the existing benchmarks have the following issues: 1. **Inconsistency with actual code repositories**: Current benchmarks do not accurately reflect the code distribution and dependency distribution in real code repositories. 2. **Lack of comprehensive annotations**: Existing benchmarks lack detailed annotations such as natural language requirements, original repositories, reference code, and reference dependencies. 3. **Inadequate evaluation metrics**: Current benchmarks lack robust evaluation metrics such as functional correctness and dependency recall rate. 4. **Risk of data leakage**: With the emergence of more LLMs, existing benchmarks pose a potential risk of data leakage. To address these issues, the paper proposes a new code generation benchmark—EvoCodeBench. EvoCodeBench has the following main advantages: 1. **Alignment with actual code repositories**: EvoCodeBench collects samples from high-quality real open-source repositories, ensuring that its code distribution and dependency distribution are consistent with actual repositories. 2. **Provision of comprehensive annotations**: EvoCodeBench provides detailed annotations, including natural language requirements, original repositories, reference code, and reference dependencies. 3. **Robust evaluation metrics**: EvoCodeBench includes test cases to evaluate the functional correctness of model predictions and reports Pass@k. Additionally, Recall@k is proposed to evaluate dependencies in predictions. 4. **Avoidance of data leakage**: EvoCodeBench is a dynamically updated benchmark, regularly updated from the latest repositories to avoid data leakage. Through these improvements, EvoCodeBench can more accurately evaluate the code generation capabilities of LLMs in actual code repositories. The paper also evaluates 10 popular LLMs based on EvoCodeBench, revealing their performance in real repositories, analyzing failure cases, and summarizing the shortcomings of existing LLMs.

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations

DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories

Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM

DevEval: Evaluating Code Generation in Practical Software Projects

Towards more realistic evaluation of LLM-based code generation: an experimental study and beyond

CodeJudge: Evaluating Code Generation with Large Language Models

ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code

How Well Do LLMs Generate Code for Different Application Domains? Benchmark and Evaluation

ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code

Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review

CodeEditorBench: Evaluating Code Editing Capability of Large Language Models

Evaluating Large Language Models in Class-Level Code Generation

CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks

TestBench: Evaluating Class-Level Test Case Generation Capability of Large Language Models

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks

CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models

The First Prompt Counts the Most! An Evaluation of Large Language Models on Iterative Example-based Code Generation