Abstract:The task of repository-level code completion is to continue writing the unfinished code based on a broader context of the repository. While for automated code completion tools, it is difficult to utilize the useful information scattered in different files. We propose RepoCoder, a simple, generic, and effective framework to address the challenge. It streamlines the repository-level code completion process by incorporating a similarity-based retriever and a pre-trained code language model in an iterative retrieval-generation pipeline. RepoCoder makes effective utilization of repository-level information for code completion and has the ability to generate code at various levels of granularity. Moreover, we propose a new benchmark RepoEval, which consists of the latest and high-quality real-world repositories covering line, API invocation, and function body completion scenarios. Experimental results indicate that RepoCoder significantly improves the In-File completion baseline by over 10% in all settings and consistently outperforms the vanilla retrieval-augmented code completion approach. Furthermore, we validate the effectiveness of RepoCoder through comprehensive analysis, providing valuable insights for future research. Our source code and benchmark are publicly available: <a class="link-external link-https" href="https://github.com/microsoft/CodeT/tree/main/RepoCoder" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that during the software development process, developers need to take into account the information of other files in the repository when programming in order to complete the unfinished code. Existing automatic code - completion tools usually only rely on the information within the file and are difficult to effectively utilize the repository - level context information, which limits their performance. Therefore, the paper proposes a new framework - RepoCoder, aiming to improve the repository - level code - completion task through iterative retrieval and generation methods, so as to make better use of the repository - level information. Specifically, the paper mainly focuses on the following aspects: 1. **Repository - level code completion**: Traditional code - completion tools mainly rely on the internal context information of the file and ignore the relevant information of other files in the repository. This leads to poor performance when dealing with complex, cross - file code - completion tasks. RepoCoder achieves repository - level code completion by combining similarity retrieval and pre - trained code language models. 2. **Iterative retrieval and generation framework**: In order to further improve the accuracy of code completion, the paper proposes an iterative retrieval and generation framework. In each iteration, the model will adjust the retrieval query according to the code snippet generated in the previous iteration, thereby gradually improving the quality of the retrieved relevant code snippets and finally generating more accurate code - completion results. 3. **New benchmark test set**: In order to evaluate the repository - level code - completion task, the paper introduces a new benchmark test set - RepoEval. This benchmark test set contains the latest high - quality repositories, covering multiple scenarios such as line - level, API call and function body completion, and uses unit tests to evaluate the functional correctness of the completed code instead of relying solely on similarity metrics. 4. **Experimental verification**: Through extensive experiments, the paper shows the performance of RepoCoder on multiple different language models, which is significantly better than the traditional in - file code - completion methods, and after multiple iterations, the performance continues to improve, exceeding the single retrieval - enhanced generation method. In summary, the main goal of the paper is to improve the accuracy and practicality of code completion by introducing repository - level context information, thereby helping developers write code more efficiently.

RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation

RLCoder: Reinforcement Learning for Repository-Level Code Completion

Repoformer: Selective Retrieval for Repository-Level Code Completion

R2C2-Coder: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models

RepoGenReflex: Enhancing Repository-Level Code Completion with Verbal Reinforcement and Retrieval-Augmented Generation

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

GraphCoder: Enhancing Repository-Level Code Completion Via Coarse-to-fine Retrieval Based on Code Context Graph

GraphCoder: Enhancing Repository-Level Code Completion via Code Context Graph-based Retrieval and Language Model

RepoMinCoder: Improving Repository-Level Code Generation Based on Information Loss Screening.

REPOFUSE: Repository-Level Code Completion with Fused Dual Context

RepoMasterEval: Evaluating Code Completion via Real-World Repositories

ExecRepoBench: Multi-level Executable Code Completion Evaluation

RepoHyper: Search-Expand-Refine on Semantic Graphs for Repository-Level Code Completion

RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph

Dataflow-Guided Retrieval Augmentation for Repository-Level Code Completion

RepoFusion: Training Code Models to Understand Your Repository

Prompt-based Code Completion via Multi-Retrieval Augmented Generation

ReACC: A Retrieval-Augmented Code Completion Framework

Enhancing Repository-Level Code Generation with Integrated Contextual Information

A Lightweight Framework for Adaptive Retrieval In Code Completion With Critique Model

RAMBO: Enhancing RAG-based Repository-Level Method Body Completion