Enhancing Repository-Level Code Generation with Integrated Contextual Information

Zhiyuan Pan,Xing Hu,Xin Xia,Xiaohu Yang

2024-06-05

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, repository-level code generation presents unique challenges, particularly due to the need to utilize information spread across multiple files within a repository. Existing retrieval-based approaches sometimes fall short as they are limited in obtaining a broader and deeper repository context. In this paper, we present CatCoder, a novel code generation framework designed for statically typed programming languages. CatCoder enhances repository-level code generation by integrating relevant code and type context. Specifically, it leverages static analyzers to extract type dependencies and merges this information with retrieved code to create comprehensive prompts for LLMs. To evaluate the effectiveness of CatCoder, we adapt and construct benchmarks that include 199 Java tasks and 90 Rust tasks. The results show that CatCoder outperforms the RepoCoder baseline by up to 17.35%, in terms of pass@k score. Furthermore, the generalizability of CatCoder is assessed using various LLMs, including both code-specialized models and general-purpose models. Our findings indicate consistent performance improvements across all models, which underlines the practicality of CatCoder.

Software Engineering,Artificial Intelligence

What problem does this paper attempt to address?

The paper attempts to address the problem of how to more effectively utilize cross-file contextual information to improve the quality of code generation at the repository level. Specifically, existing retrieval-based methods have limitations in obtaining broader and deeper repository context, especially in statically typed programming languages. Relying solely on retrieved code snippets may not cover all the logic and API calls required to generate correct code. Additionally, the lack of cross-file contextual information may lead to the model generating incorrect field accesses or API calls (i.e., "hallucinations"). To address these issues, the paper proposes a new framework called CatCoder, which enhances repository-level code generation by incorporating relevant code and type context. The main contributions of CatCoder include: 1. **Proposing a new framework**: For statically typed programming languages, generating repository-level code by highlighting relevant code and type context. 2. **Constructing a Rust benchmark dataset**: This is the first Rust benchmark dataset for evaluating repository-level code generation performance. 3. **Evaluating the effectiveness of CatCoder**: Experimental results on Java and Rust benchmark datasets show that CatCoder significantly outperforms baseline methods in terms of performance. Overall, the paper aims to improve the performance of large language models in repository-level code generation tasks by introducing type context and an improved code retrieval mechanism.

Enhancing Repository-Level Code Generation with Integrated Contextual Information

On the Impacts of Contexts on Repository-Level Code Generation

R2C2-Coder: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models

Teaching Code LLMs to Use Autocompletion Tools in Repository-Level Code Generation

Class-Level Code Generation from Natural Language Using Iterative, Tool-Enhanced Reasoning over Repository

Iterative Refinement of Project-Level Code Context for Precise Code Generation with Compiler Feedback

RLCoder: Reinforcement Learning for Repository-Level Code Completion

ContextModule: Improving Code Completion via Repository-level Contextual Information

Enhancing LLM-Based Coding Tools through Native Integration of IDE-Derived Static Context

CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges

GraphCoder: Enhancing Repository-Level Code Completion via Code Context Graph-based Retrieval and Language Model

Repository-Level Prompt Generation for Large Language Models of Code

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation

Repoformer: Selective Retrieval for Repository-Level Code Completion

CodeRAG-Bench: Can Retrieval Augment Code Generation?

StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback

Improving Natural Language Capability of Code Large Language Model

A^3-CodGen: A Repository-Level Code Generation Framework for Code Reuse with Local-Aware, Global-Aware, and Third-Party-Library-Aware

StepCoder: Improving Code Generation with Reinforcement Learning from Compiler Feedback

RepoMinCoder: Improving Repository-Level Code Generation Based on Information Loss Screening.