Abstract:LLMs have demonstrated significant potential in code generation tasks, achieving promising results at the function or statement level across various benchmarks. However, the complexities associated with creating code artifacts like classes, particularly within the context of real-world software repositories, remain underexplored. Prior research treats class-level generation as an isolated task, neglecting the intricate dependencies & interactions that characterize real-world software environments. To address this gap, we introduce RepoClassBench, a comprehensive benchmark designed to rigorously evaluate LLMs in generating complex, class-level code within real-world repositories. RepoClassBench includes "Natural Language to Class generation" tasks across Java, Python & C# from a selection of repositories. We ensure that each class in our dataset not only has cross-file dependencies within the repository but also includes corresponding test cases to verify its functionality. We find that current models struggle with the realistic challenges posed by our benchmark, primarily due to their limited exposure to relevant repository contexts. To address this shortcoming, we introduce Retrieve-Repotools-Reflect (RRR), a novel approach that equips LLMs with static analysis tools to iteratively navigate & reason about repository-level context in an agent-based framework. Our experiments demonstrate that RRR significantly outperforms existing baselines on RepoClassBench, showcasing its effectiveness across programming languages & under various settings. Our findings emphasize the critical need for code-generation benchmarks to incorporate repo-level dependencies to more accurately reflect the complexities of software development. Our work shows the benefits of leveraging specialized tools to enhance LLMs' understanding of repository context. We plan to make our dataset & evaluation harness public.

Teaching Code LLMs to Use Autocompletion Tools in Repository-Level Code Generation

Enhancing Repository-Level Code Generation with Integrated Contextual Information

CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges

How Well Do LLMs Generate Code for Different Application Domains? Benchmark and Evaluation

CodeLutra: Boosting LLM Code Generation via Preference-Guided Refinement

Iterative Refinement of Project-Level Code Context for Precise Code Generation with Compiler Feedback

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

Enhancing LLM-Based Coding Tools through Native Integration of IDE-Derived Static Context

LLM-Assisted Code Cleaning For Training Accurate Code Generators

Improving Natural Language Capability of Code Large Language Model

Imperfect Code Generation: Uncovering Weaknesses in Automatic Code Generation by Large Language Models

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

ContextModule: Improving Code Completion via Repository-level Contextual Information

Unseen Horizons: Unveiling the Real Capability of LLM Code Generation Beyond the Familiar

GraphCoder: Enhancing Repository-Level Code Completion via Code Context Graph-based Retrieval and Language Model

Enhancing LLM-based Test Generation for Hard-to-Cover Branches via Program Analysis

Helping LLMs Improve Code Generation Using Feedback from Testing and Static Analysis

On the Effectiveness of Large Language Models in Domain-Specific Code Generation

STALL+: Boosting LLM-based Repository-level Code Completion with Static Analysis

Class-Level Code Generation from Natural Language Using Iterative, Tool-Enhanced Reasoning over Repository