Repoformer: Selective Retrieval for Repository-Level Code Completion

Di Wu,Wasi Uddin Ahmad,Dejiao Zhang,Murali Krishna Ramanathan,Xiaofei Ma

2024-06-04

Abstract:Recent advances in retrieval-augmented generation (RAG) have initiated a new era in repository-level code completion. However, the invariable use of retrieval in existing methods exposes issues in both efficiency and robustness, with a large proportion of the retrieved contexts proving unhelpful or harmful to code language models (code LMs). In this paper, we propose a selective RAG framework to avoid retrieval when unnecessary. To power this framework, we design a self-supervised learning approach to enable a code LM to accurately self-evaluate whether retrieval can improve its output quality and robustly leverage the potentially noisy retrieved contexts. Using this LM as both the selective RAG policy and the generation model, our framework achieves state-of-the-art repository-level code completion performance on diverse benchmarks including RepoEval, CrossCodeEval, and CrossCodeLongEval, a new long-form code completion benchmark. Meanwhile, our analyses show that selectively retrieving brings as much as 70% inference speedup in the online serving setting without harming the performance. We further demonstrate that our framework is able to accommodate different generation models, retrievers, and programming languages. These advancements position our framework as an important step towards more accurate and efficient repository-level code completion.

Software Engineering,Computation and Language

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the issues of efficiency and robustness in **repository-level code completion**. Specifically: 1. **Problems with Existing Methods**: - Existing retrieval-augmented generation (RAG) methods always perform retrieval during code completion, leading to efficiency and robustness issues. In many cases, the retrieved contextual information not only fails to assist the code language models (code LMs) but also introduces irrelevant or even harmful information. 2. **Proposed New Framework**: - The authors propose a selective RAG framework (REPOFORMER) that decides when to perform retrieval through a self-assessment mechanism. This framework trains the code language model through self-supervised learning, enabling it to accurately judge whether retrieval can improve its output quality and robustly utilize retrieval results in the presence of potential noise. 3. **Experimental Results**: - In various benchmarks, REPOFORMER shows significant performance improvements compared to methods that always perform retrieval, particularly excelling in edit similarity over other models of the same scale. Additionally, selective retrieval brings up to a 70% increase in inference speed without compromising accuracy. 4. **Generality and Flexibility**: - REPOFORMER can be applied to language models of different scales and can adapt to different retrievers and programming languages, demonstrating good generality and flexibility. Through this research, the authors hope to advance more efficient and accurate repository-level code completion technology.

Repoformer: Selective Retrieval for Repository-Level Code Completion

RepoGenReflex: Enhancing Repository-Level Code Completion with Verbal Reinforcement and Retrieval-Augmented Generation

RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation

RLCoder: Reinforcement Learning for Repository-Level Code Completion

A Lightweight Framework for Adaptive Retrieval In Code Completion With Critique Model

Prompt-based Code Completion via Multi-Retrieval Augmented Generation

RepoMinCoder: Improving Repository-Level Code Generation Based on Information Loss Screening.

REPOFUSE: Repository-Level Code Completion with Fused Dual Context

GraphCoder: Enhancing Repository-Level Code Completion Via Coarse-to-fine Retrieval Based on Code Context Graph

ReACC: A Retrieval-Augmented Code Completion Framework

GraphCoder: Enhancing Repository-Level Code Completion via Code Context Graph-based Retrieval and Language Model

Dataflow-Guided Retrieval Augmentation for Repository-Level Code Completion

CodeRAG-Bench: Can Retrieval Augment Code Generation?

Preference-Guided Refactored Tuning for Retrieval Augmented Code Generation

Exploring Demonstration Retrievers in RAG for Coding Tasks: Yeas and Nays!

R2C2-Coder: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models

LLM Agents Improve Semantic Code Search

Enhancing Repository-Level Code Generation with Integrated Contextual Information

Teaching Code LLMs to Use Autocompletion Tools in Repository-Level Code Generation

ContextModule: Improving Code Completion via Repository-level Contextual Information

RRGcode: Deep hierarchical search-based code generation