ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation

Xueying Du,Mingwei Liu,Kaixin Wang,Hanlin Wang,Junwei Liu,Yixuan Chen,Jiayi Feng,Chaofeng Sha,Xin Peng,Yiling Lou
2023-08-14
Abstract:In this work, we make the first attempt to evaluate LLMs in a more challenging code generation scenario, i.e. class-level code generation. We first manually construct the first class-level code generation benchmark ClassEval of 100 class-level Python code generation tasks with approximately 500 person-hours. Based on it, we then perform the first study of 11 state-of-the-art LLMs on class-level code generation. Based on our results, we have the following main findings. First, we find that all existing LLMs show much worse performance on class-level code generation compared to on standalone method-level code generation benchmarks like HumanEval; and the method-level coding ability cannot equivalently reflect the class-level coding ability among LLMs. Second, we find that GPT-4 and GPT-3.5 still exhibit dominate superior than other LLMs on class-level code generation, and the second-tier models includes Instruct-Starcoder, Instruct-Codegen, and Wizardcoder with very similar performance. Third, we find that generating the entire class all at once (i.e. holistic generation strategy) is the best generation strategy only for GPT-4 and GPT-3.5, while method-by-method generation (i.e. incremental and compositional) is better strategies for the other models with limited ability of understanding long instructions and utilizing the middle information. Lastly, we find the limited model ability of generating method-dependent code and discuss the frequent error types in generated classes. Our benchmark is available at <a class="link-external link-https" href="https://github.com/FudanSELab/ClassEval" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to address the issue of the lack of evaluation for complex code generation tasks (especially class-level code generation) in the context where existing code generation benchmarks mainly focus on simple code generation scenarios (such as function-level or statement-level code generation). Specifically, existing evaluation methods mainly focus on generating short and independent code units, ignoring the ability to generate composite code units composed of multiple interdependent methods. Therefore, the authors constructed a new benchmark, ClassEval, specifically for evaluating the performance of large language models (LLMs) in class-level code generation tasks. ### Main Issues 1. **Limitations of Existing Benchmarks**: - Existing benchmarks mainly focus on generating short and independent code units (such as functions or statements), which cannot comprehensively evaluate the ability of LLMs to generate long code fragments and composite code units. - Existing benchmarks assume that the generated code is independent, ignoring the dependencies between methods in actual development. 2. **Need for Class-Level Code Generation**: - Class-level code generation tasks are more complex, involving multiple interdependent methods. - A dedicated benchmark is needed to evaluate the performance of LLMs in handling such complex tasks. ### Solution To fill this knowledge gap, the authors constructed the first class-level code generation benchmark, ClassEval, and conducted the first study to evaluate the performance of 11 state-of-the-art LLMs in class-level code generation tasks. ClassEval contains 100 class-level Python code generation tasks, each designed with high-coverage test suites to ensure the correctness of the generated code. ### Main Contributions 1. **Constructed the first class-level code generation benchmark, ClassEval**, manually building 100 class-level Python code generation tasks covering a wide range of practical software development topics. 2. **Conducted the first study** to evaluate the performance of 11 representative LLMs in class-level code generation tasks, using three different generation strategies (holistic generation, incremental generation, and compositional generation). 3. **Discovered performance differences of existing LLMs in class-level code generation tasks** and discussed common types of generation errors. Through this work, the authors hope to promote research on the capabilities of LLMs in complex code generation tasks and provide references for future improvements and developments.