Abstract:Recently, many large language models (LLMs) have been proposed, showing advanced proficiency in code generation. Meanwhile, many efforts have been dedicated to evaluating LLMs on code generation benchmarks such as HumanEval. Although being very helpful for comparing different LLMs, existing evaluation focuses on a sim-ple code generation scenario (i.e., function-level or statement-level code generation), which mainly asks LLMs to generate one single code unit (e.g., a function or a statement) for the given natural language description. Such evaluation focuses on generating independent and often small-scale code units, thus leaving it unclear how LLMs perform in real-world software development scenarios. To fill this knowledge gap, we make the first attempt to evaluate LLMs in a more challenging code generation scenario, i.e., class-level code generation. Compared with existing code generation benchmarks, it better reflects real-world software development scenarios due to it comprising broader contextual dependencies and multiple, interdependent units of code. We first manually construct the first class-level code generation benchmark ClassEval of 100 class-level Python code generation tasks with approximately 500 person-hours. Based on the new benchmark ClassEval, we then perform the first study of 11 state-of-the-art LLMs on class-level code generation. Based on our results, we find that all LLMs perform much worse on class-level code generation compared to the method-level. While GPT models still dominate other LLMs on class-level code generation, the performance rankings of other models on method-level code generation no longer holds for class-level code generation. Besides, most models (except GPT models) perform better when generating the class method by method; and they have the limited ability of generating dependent code. Based on our findings, we call for software engineering (SE) researchers' expertise to build more LLM benchmarks based on practical and complicated software development scenarios.

Multi-lingual Evaluation of Code Generation Models

MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation

CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X

MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation

mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation

HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization

xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval

McEval: Massively Multilingual Code Evaluation

CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution

L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models

The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models

MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code Generation

Multi-Programming Language Ensemble for Code Generation in Large Language Model

MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems

Evaluating Large Language Models in Class-Level Code Generation

Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review

CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks

CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation

GEMv2: Multilingual NLG Benchmarking in a Single Line of Code

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks