Abstract:LLMs have demonstrated significant potential in code generation tasks, achieving promising results at the function or statement level across various benchmarks. However, the complexities associated with creating code artifacts like classes, particularly within the context of real-world software repositories, remain underexplored. Prior research treats class-level generation as an isolated task, neglecting the intricate dependencies & interactions that characterize real-world software environments. To address this gap, we introduce RepoClassBench, a comprehensive benchmark designed to rigorously evaluate LLMs in generating complex, class-level code within real-world repositories. RepoClassBench includes "Natural Language to Class generation" tasks across Java, Python & C# from a selection of repositories. We ensure that each class in our dataset not only has cross-file dependencies within the repository but also includes corresponding test cases to verify its functionality. We find that current models struggle with the realistic challenges posed by our benchmark, primarily due to their limited exposure to relevant repository contexts. To address this shortcoming, we introduce Retrieve-Repotools-Reflect (RRR), a novel approach that equips LLMs with static analysis tools to iteratively navigate & reason about repository-level context in an agent-based framework. Our experiments demonstrate that RRR significantly outperforms existing baselines on RepoClassBench, showcasing its effectiveness across programming languages & under various settings. Our findings emphasize the critical need for code-generation benchmarks to incorporate repo-level dependencies to more accurately reflect the complexities of software development. Our work shows the benefits of leveraging specialized tools to enhance LLMs' understanding of repository context. We plan to make our dataset & evaluation harness public.

Repository-Level Prompt Generation for Large Language Models of Code

A Review of Repository Level Prompting for LLMs

RepoFusion: Training Code Models to Understand Your Repository

RLCoder: Reinforcement Learning for Repository-Level Code Completion

Enhancing Repository-Level Code Generation with Integrated Contextual Information

R2C2-Coder: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models

AceCoder : An Effective Prompting Technique Specialized in Code Generation

Prompting and Fine-tuning Large Language Models for Automated Code Review Comment Generation

Code Prompting: a Neural Symbolic Method for Complex Reasoning in Large Language Models

Prompt-based Code Completion via Multi-Retrieval Augmented Generation

RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation

Class-Level Code Generation from Natural Language Using Iterative, Tool-Enhanced Reasoning over Repository

What Makes Large Language Models Reason in (Multi-Turn) Code Generation?

Prompt Space Optimizing Few-shot Reasoning Success with Large Language Models

Repoformer: Selective Retrieval for Repository-Level Code Completion

Active Prompting with Chain-of-Thought for Large Language Models

Set-Based Prompting: Provably Solving the Language Model Order Dependency Problem

Genetic Auto-prompt Learning for Pre-trained Code Intelligence Language Models

RePrompt: Planning by Automatic Prompt Engineering for Large Language Models Agents

Selective Prompt Anchoring for Code Generation

Prompt2Model: Generating Deployable Models from Natural Language Instructions