GraphCoder: Enhancing Repository-Level Code Completion via Code Context Graph-based Retrieval and Language Model

Wei Liu,Ailun Yu,Daoguang Zan,Bo Shen,Wei Zhang,Haiyan Zhao,Zhi Jin,Qianxiang Wang
2024-09-13
Abstract:The performance of repository-level code completion depends upon the effective leverage of both general and repository-specific knowledge. Despite the impressive capability of code LLMs in general code completion tasks, they often exhibit less satisfactory performance on repository-level completion due to the lack of repository-specific knowledge in these LLMs. To address this problem, we propose GraphCoder, a retrieval-augmented code completion framework that leverages LLMs' general code knowledge and the repository-specific knowledge via a graph-based retrieval-generation process. In particular, GraphCoder captures the context of completion target more accurately through code context graph (CCG) that consists of control-flow, data- and control-dependence between code statements, a more structured way to capture the completion target context than the sequence-based context used in existing retrieval-augmented approaches; based on CCG, GraphCoder further employs a coarse-to-fine retrieval process to locate context-similar code snippets with the completion target from the current repository. Experimental results demonstrate both the effectiveness and efficiency of GraphCoder: Compared to baseline retrieval-augmented methods, GraphCoder achieves higher exact match (EM) on average, with increases of +6.06 in code match and +6.23 in identifier match, while using less time and space.
Software Engineering
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in the code completion task at the repository level, large - language models (LLMs) perform poorly due to the lack of repository - specific knowledge. Specifically, although LLMs perform well in general code completion tasks, in repository - level completion tasks, because they cannot learn or access repository - specific knowledge (such as code style and API usage within the library) well, their performance is usually not satisfactory. To solve this problem, the paper proposes GraphCoder, which is a retrieval - enhanced generation framework based on the code context graph (CCG), aiming to utilize the general code knowledge of LLMs and repository - specific knowledge to improve the effectiveness and efficiency of repository - level code completion through the graph retrieval and generation process. The main innovation of GraphCoder lies in that it more accurately captures the context of the completion target by constructing the code context graph (CCG), which is more structured than the existing sequence - based context - capturing methods. In addition, GraphCoder also adopts a coarse - grained to fine - grained retrieval process to find code fragments similar to the context of the completion target from the current repository. Experimental results show that compared with the baseline retrieval - enhanced methods, GraphCoder has a significant improvement in both code exact - match and identifier exact - match, and is also more efficient in terms of time and space consumption.