Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion?

Zhenyu Pan,Rongyu Cao,Yongchang Cao,Yingwei Ma,Binhua Li,Fei Huang,Han Liu,Yongbin Li

2024-10-24

Abstract:Code completion, a key downstream task in code generation, is one of the most frequent and impactful methods for enhancing developer productivity in software development. As intelligent completion tools evolve, we need a robust evaluation benchmark that enables meaningful comparisons between products and guides future advancements. However, existing benchmarks focus more on coarse-grained tasks without industrial analysis resembling general code generation rather than the real-world scenarios developers encounter. Moreover, these benchmarks often rely on costly and time-consuming human annotation, and the standalone test cases fail to leverage minimal tests for maximum repository-level understanding and code coverage. To address these limitations, we first analyze business data from an industrial code completion tool and redefine the evaluation criteria to better align with the developer's intent and desired completion behavior throughout the coding process. Based on these insights, we introduce Codev-Agent, an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage, ensuring fair and effective comparisons. Using Codev-Agent, we present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework. Codev-Bench assesses whether a code completion tool can capture a developer's immediate intent and suggest appropriate code across diverse contexts, providing a more realistic benchmark for code completion in modern software development.

Software Engineering,Artificial Intelligence

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the shortcomings of existing code completion tool evaluation benchmarks. Specifically, the current evaluation benchmarks have the following three main issues: 1. **Coarse-grained Tasks**: Existing evaluation benchmarks mainly focus on coarse-grained tasks, such as class or function generation, which do not align with the real scenarios developers encounter in actual development. 2. **Manually Annotated Data**: Current evaluation benchmarks rely on expensive and time-consuming manually annotated data samples and test cases, making continuous updates difficult and inflexible. 3. **Isolated Generation of Test Cases**: The test cases generated by existing evaluation benchmarks cannot fully utilize minimal tests to achieve maximum repository-level understanding and code coverage. To address these issues, the authors propose a new evaluation framework **Codev-Bench** and an automated system **Codev-Agent**. By analyzing feedback data from industrial-grade code completion tools, they redefine the evaluation criteria to better align with developers' intentions and expected completion behaviors. Codev-Agent automates processes such as repository crawling, execution environment construction, dynamic call chain extraction, and test case generation, ensuring fair and effective comparisons. Codev-Bench is a fine-grained, real-world, repository-level, and developer-centric evaluation framework that can more accurately assess the performance of code completion tools in modern software development.

Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion?

Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion?

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

RepoMasterEval: Evaluating Code Completion via Real-World Repositories

CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges

R2C2-Coder: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models

DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories

CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion

DevEval: Evaluating Code Generation in Practical Software Projects

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks

Teaching Code LLMs to Use Autocompletion Tools in Repository-Level Code Generation

CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation

ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

CoderUJB: An Executable and Unified Java Benchmark for Practical Programming Scenarios

CodeEditorBench: Evaluating Code Editing Capability of Large Language Models

RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation

EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations

NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts

CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models