Abstract:With the growing reliance on automated code completion tools in software development, the need for robust evaluation benchmarks has become critical. However, existing benchmarks focus more on code generation tasks in function and class level and provide rich text description to prompt the model. By contrast, such descriptive prompt is commonly unavailable in real development and code completion can occur in wider range of situations such as in the middle of a function or a code block. These limitations makes the evaluation poorly align with the practical scenarios of code completion tools. In this paper, we propose RepoMasterEval, a novel benchmark for evaluating code completion models constructed from real-world Python and TypeScript repositories. Each benchmark datum is generated by masking a code snippet (ground truth) from one source code file with existing test suites. To improve test accuracy of model generated code, we employ mutation testing to measure the effectiveness of the test cases and we manually crafted new test cases for those test suites with low mutation score. Our empirical evaluation on 6 state-of-the-art models shows that test argumentation is critical in improving the accuracy of the benchmark and RepoMasterEval is able to report difference in model performance in real-world scenarios. The deployment of RepoMasterEval in a collaborated company for one month also revealed that the benchmark is useful to give accurate feedback during model training and the score is in high correlation with the model's performance in practice. Based on our findings, we call for the software engineering community to build more LLM benchmarks tailored for code generation tools taking the practical and complex development environment into consideration.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to address the limitations of existing code completion tool evaluation benchmarks in order to better reflect code - completion scenarios in the real world. Specifically, the paper points out that the current evaluation benchmarks have the following four main problems: 1. **Simple scenarios**: Existing benchmarks mainly focus on relatively simple code - generation tasks, such as statement - level, function - level or class - level generation. These tasks usually involve generating a single code unit (for example, a statement, function or class) in isolation. However, in actual software development, code - generation tasks can occur in the middle of a code block and may not have subsequent code, which makes the existing benchmarks unable to accurately reflect the actual code - completion requirements. 2. **Lack of repository - level context information**: Although some benchmarks are constructed from code repositories, they do not fully utilize the rich context information in the repositories to improve the accuracy of model predictions. However, actual code - completion tools have already utilized this information. 3. **Limited test - suite quality**: Existing benchmarks mainly rely on predefined test cases to evaluate the correctness of the model - generated code, but these test cases are often insufficient and may overlook some edge cases. 4. **Lack of research on the correlation between model performance and practical applications**: At present, there is no research exploring the correlation between benchmark performance and usability in the actual production environment, which makes it difficult to determine the actual effectiveness of the benchmark. To solve these problems, the paper introduces a new benchmark - **RepoMasterEval**, which improves the evaluation of code - completion models in the following ways: - **Construct the benchmark from real - world Python and TypeScript repositories** to ensure that the evaluation is closer to the actual development scenarios. - **Adopt mutation testing and manual test - case writing** to ensure the quality and coverage of test cases. - **Provide a comprehensive task structure**, including prefixes, suffixes, retrieved information and test cases, to simulate the code - completion scenario in an IDE. - **Conduct an industrial - practice evaluation for the first time** to verify the effectiveness and relevance of the benchmark in practical applications. Through these improvements, RepoMasterEval can more accurately evaluate the performance of code - completion models in the real world and provide valuable feedback for optimizing these models.

RepoMasterEval: Evaluating Code Completion via Real-World Repositories

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation

Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion?

RepoQA: Evaluating Long Context Code Understanding

R2C2-Coder: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories

M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation

ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code

RepoGenReflex: Enhancing Repository-Level Code Completion with Verbal Reinforcement and Retrieval-Augmented Generation

Repoformer: Selective Retrieval for Repository-Level Code Completion

CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges

RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph

Can Language Models Replace Programmers? REPOCOD Says 'Not Yet'

CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion

CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models

ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code

TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark

CCTEST: Testing and Repairing Code Completion Systems