Repoformer: Selective Retrieval for Repository-Level Code Completion

Di Wu,Wasi Uddin Ahmad,Dejiao Zhang,Murali Krishna Ramanathan,Xiaofei Ma
2024-06-04
Abstract:Recent advances in retrieval-augmented generation (RAG) have initiated a new era in repository-level code completion. However, the invariable use of retrieval in existing methods exposes issues in both efficiency and robustness, with a large proportion of the retrieved contexts proving unhelpful or harmful to code language models (code LMs). In this paper, we propose a selective RAG framework to avoid retrieval when unnecessary. To power this framework, we design a self-supervised learning approach to enable a code LM to accurately self-evaluate whether retrieval can improve its output quality and robustly leverage the potentially noisy retrieved contexts. Using this LM as both the selective RAG policy and the generation model, our framework achieves state-of-the-art repository-level code completion performance on diverse benchmarks including RepoEval, CrossCodeEval, and CrossCodeLongEval, a new long-form code completion benchmark. Meanwhile, our analyses show that selectively retrieving brings as much as 70% inference speedup in the online serving setting without harming the performance. We further demonstrate that our framework is able to accommodate different generation models, retrievers, and programming languages. These advancements position our framework as an important step towards more accurate and efficient repository-level code completion.
Software Engineering,Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the issues of efficiency and robustness in **repository-level code completion**. Specifically: 1. **Problems with Existing Methods**: - Existing retrieval-augmented generation (RAG) methods always perform retrieval during code completion, leading to efficiency and robustness issues. In many cases, the retrieved contextual information not only fails to assist the code language models (code LMs) but also introduces irrelevant or even harmful information. 2. **Proposed New Framework**: - The authors propose a selective RAG framework (REPOFORMER) that decides when to perform retrieval through a self-assessment mechanism. This framework trains the code language model through self-supervised learning, enabling it to accurately judge whether retrieval can improve its output quality and robustly utilize retrieval results in the presence of potential noise. 3. **Experimental Results**: - In various benchmarks, REPOFORMER shows significant performance improvements compared to methods that always perform retrieval, particularly excelling in edit similarity over other models of the same scale. Additionally, selective retrieval brings up to a 70% increase in inference speed without compromising accuracy. 4. **Generality and Flexibility**: - REPOFORMER can be applied to language models of different scales and can adapt to different retrievers and programming languages, demonstrating good generality and flexibility. Through this research, the authors hope to advance more efficient and accurate repository-level code completion technology.