Abstract:Recently, many large language models (LLMs) have been proposed, showing advanced proficiency in code generation. Meanwhile, many efforts have been dedicated to evaluating LLMs on code generation benchmarks such as HumanEval. Although being very helpful for comparing different LLMs, existing evaluation focuses on a sim-ple code generation scenario (i.e., function-level or statement-level code generation), which mainly asks LLMs to generate one single code unit (e.g., a function or a statement) for the given natural language description. Such evaluation focuses on generating independent and often small-scale code units, thus leaving it unclear how LLMs perform in real-world software development scenarios. To fill this knowledge gap, we make the first attempt to evaluate LLMs in a more challenging code generation scenario, i.e., class-level code generation. Compared with existing code generation benchmarks, it better reflects real-world software development scenarios due to it comprising broader contextual dependencies and multiple, interdependent units of code. We first manually construct the first class-level code generation benchmark ClassEval of 100 class-level Python code generation tasks with approximately 500 person-hours. Based on the new benchmark ClassEval, we then perform the first study of 11 state-of-the-art LLMs on class-level code generation. Based on our results, we find that all LLMs perform much worse on class-level code generation compared to the method-level. While GPT models still dominate other LLMs on class-level code generation, the performance rankings of other models on method-level code generation no longer holds for class-level code generation. Besides, most models (except GPT models) perform better when generating the class method by method; and they have the limited ability of generating dependent code. Based on our findings, we call for software engineering (SE) researchers' expertise to build more LLM benchmarks based on practical and complicated software development scenarios.

Imperfect Code Generation: Uncovering Weaknesses in Automatic Code Generation by Large Language Models

Uncovering Weaknesses in Neural Code Generation

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

Where Do Large Language Models Fail When Generating Code?

Enhancing Large Language Models for Secure Code Generation: A Dataset-driven Study on Vulnerability Mitigation

Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval

Evaluating Large Language Models in Class-Level Code Generation

A Survey on Evaluating Large Language Models in Code Generation Tasks

Do Large Language Models Pay Similar Attention Like Human Programmers When Generating Code?

A Survey on Large Language Models for Code Generation

On the Effectiveness of Large Language Models in Domain-Specific Code Generation

Beyond Functional Correctness: Investigating Coding Style Inconsistencies in Large Language Models

Understanding Defects in Generated Codes by Language Models

AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models

Unseen Horizons: Unveiling the Real Capability of LLM Code Generation Beyond the Familiar

Pitfalls in Language Models for Code Intelligence: A Taxonomy and Survey

Teaching Code LLMs to Use Autocompletion Tools in Repository-Level Code Generation

Bugs in Large Language Models Generated Code: An Empirical Study

Where Are Large Language Models for Code Generation on GitHub?

Can LLM Replace Stack Overflow? A Study on Robustness and Reliability of Large Language Model Code Generation

Examination of Code generated by Large Language Models