Abstract:In this work, we make the first attempt to evaluate LLMs in a more challenging code generation scenario, i.e. class-level code generation. We first manually construct the first class-level code generation benchmark ClassEval of 100 class-level Python code generation tasks with approximately 500 person-hours. Based on it, we then perform the first study of 11 state-of-the-art LLMs on class-level code generation. Based on our results, we have the following main findings. First, we find that all existing LLMs show much worse performance on class-level code generation compared to on standalone method-level code generation benchmarks like HumanEval; and the method-level coding ability cannot equivalently reflect the class-level coding ability among LLMs. Second, we find that GPT-4 and GPT-3.5 still exhibit dominate superior than other LLMs on class-level code generation, and the second-tier models includes Instruct-Starcoder, Instruct-Codegen, and Wizardcoder with very similar performance. Third, we find that generating the entire class all at once (i.e. holistic generation strategy) is the best generation strategy only for GPT-4 and GPT-3.5, while method-by-method generation (i.e. incremental and compositional) is better strategies for the other models with limited ability of understanding long instructions and utilizing the middle information. Lastly, we find the limited model ability of generating method-dependent code and discuss the frequent error types in generated classes. Our benchmark is available at <a class="link-external link-https" href="https://github.com/FudanSELab/ClassEval" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper attempts to address the issue of the lack of evaluation for complex code generation tasks (especially class-level code generation) in the context where existing code generation benchmarks mainly focus on simple code generation scenarios (such as function-level or statement-level code generation). Specifically, existing evaluation methods mainly focus on generating short and independent code units, ignoring the ability to generate composite code units composed of multiple interdependent methods. Therefore, the authors constructed a new benchmark, ClassEval, specifically for evaluating the performance of large language models (LLMs) in class-level code generation tasks. ### Main Issues 1. **Limitations of Existing Benchmarks**: - Existing benchmarks mainly focus on generating short and independent code units (such as functions or statements), which cannot comprehensively evaluate the ability of LLMs to generate long code fragments and composite code units. - Existing benchmarks assume that the generated code is independent, ignoring the dependencies between methods in actual development. 2. **Need for Class-Level Code Generation**: - Class-level code generation tasks are more complex, involving multiple interdependent methods. - A dedicated benchmark is needed to evaluate the performance of LLMs in handling such complex tasks. ### Solution To fill this knowledge gap, the authors constructed the first class-level code generation benchmark, ClassEval, and conducted the first study to evaluate the performance of 11 state-of-the-art LLMs in class-level code generation tasks. ClassEval contains 100 class-level Python code generation tasks, each designed with high-coverage test suites to ensure the correctness of the generated code. ### Main Contributions 1. **Constructed the first class-level code generation benchmark, ClassEval**, manually building 100 class-level Python code generation tasks covering a wide range of practical software development topics. 2. **Conducted the first study** to evaluate the performance of 11 representative LLMs in class-level code generation tasks, using three different generation strategies (holistic generation, incremental generation, and compositional generation). 3. **Discovered performance differences of existing LLMs in class-level code generation tasks** and discussed common types of generation errors. Through this work, the authors hope to promote research on the capabilities of LLMs in complex code generation tasks and provide references for future improvements and developments.

ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation

Evaluating Large Language Models in Class-Level Code Generation

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Escalating LLM-based Code Translation Benchmarking into the Class-level Era

MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code Generation

Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM

DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories

CodeJudge: Evaluating Code Generation with Large Language Models

How Well Do LLMs Generate Code for Different Application Domains? Benchmark and Evaluation

DevEval: Evaluating Code Generation in Practical Software Projects

Benchmarking Llama 3 70B for Code Generation: A Comprehensive Evaluation

Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review

How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark

TestBench: Evaluating Class-Level Test Case Generation Capability of Large Language Models

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

Fixing Code Generation Errors for Large Language Models

On Evaluating the Efficiency of Source Code Generated by LLMs