RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation

Fengji Zhang,Bei Chen,Yue Zhang,Jacky Keung,Jin Liu,Daoguang Zan,Yi Mao,Jian-Guang Lou,Weizhu Chen
2023-10-20
Abstract:The task of repository-level code completion is to continue writing the unfinished code based on a broader context of the repository. While for automated code completion tools, it is difficult to utilize the useful information scattered in different files. We propose RepoCoder, a simple, generic, and effective framework to address the challenge. It streamlines the repository-level code completion process by incorporating a similarity-based retriever and a pre-trained code language model in an iterative retrieval-generation pipeline. RepoCoder makes effective utilization of repository-level information for code completion and has the ability to generate code at various levels of granularity. Moreover, we propose a new benchmark RepoEval, which consists of the latest and high-quality real-world repositories covering line, API invocation, and function body completion scenarios. Experimental results indicate that RepoCoder significantly improves the In-File completion baseline by over 10% in all settings and consistently outperforms the vanilla retrieval-augmented code completion approach. Furthermore, we validate the effectiveness of RepoCoder through comprehensive analysis, providing valuable insights for future research. Our source code and benchmark are publicly available: <a class="link-external link-https" href="https://github.com/microsoft/CodeT/tree/main/RepoCoder" rel="external noopener nofollow">this https URL</a>
Computation and Language,Artificial Intelligence,Programming Languages,Software Engineering
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that during the software development process, developers need to take into account the information of other files in the repository when programming in order to complete the unfinished code. Existing automatic code - completion tools usually only rely on the information within the file and are difficult to effectively utilize the repository - level context information, which limits their performance. Therefore, the paper proposes a new framework - RepoCoder, aiming to improve the repository - level code - completion task through iterative retrieval and generation methods, so as to make better use of the repository - level information. Specifically, the paper mainly focuses on the following aspects: 1. **Repository - level code completion**: Traditional code - completion tools mainly rely on the internal context information of the file and ignore the relevant information of other files in the repository. This leads to poor performance when dealing with complex, cross - file code - completion tasks. RepoCoder achieves repository - level code completion by combining similarity retrieval and pre - trained code language models. 2. **Iterative retrieval and generation framework**: In order to further improve the accuracy of code completion, the paper proposes an iterative retrieval and generation framework. In each iteration, the model will adjust the retrieval query according to the code snippet generated in the previous iteration, thereby gradually improving the quality of the retrieved relevant code snippets and finally generating more accurate code - completion results. 3. **New benchmark test set**: In order to evaluate the repository - level code - completion task, the paper introduces a new benchmark test set - RepoEval. This benchmark test set contains the latest high - quality repositories, covering multiple scenarios such as line - level, API call and function body completion, and uses unit tests to evaluate the functional correctness of the completed code instead of relying solely on similarity metrics. 4. **Experimental verification**: Through extensive experiments, the paper shows the performance of RepoCoder on multiple different language models, which is significantly better than the traditional in - file code - completion methods, and after multiple iterations, the performance continues to improve, exceeding the single retrieval - enhanced generation method. In summary, the main goal of the paper is to improve the accuracy and practicality of code completion by introducing repository - level context information, thereby helping developers write code more efficiently.