Abstract:Code generation benchmarks such as HumanEval are widely adopted to evaluate LLMs' capabilities. However, after consolidating the latest 24 benchmarks, we noticed three significant imbalances. First, imbalanced programming language. 95.8% of benchmarks involve Python, while only 5 benchmarks involve Java. Second, imbalanced code granularity. Function-/statement-level benchmarks account for over 83.3% of benchmarks. Only a mere handful extends to class-/project-levels, and all are limited to Python. Third, lacking advanced features. Existing benchmarks primarily assess basic coding skills, while overlooking advanced Object-Oriented Programming (OOP) features (i.e., encapsulation, inheritance, and polymorphism). To fill these gaps, we propose JavaBench, a project-level Java benchmark that exercises OOP features. It comprises four Java projects with 389 methods in 106 Java classes. The test coverage is up to 92%, and JavaBench is attested by 282 undergraduate students, reaching a 90.93/100 average score (i.e., pass rate against the test suite), ensuring the quality of documentation, code skeleton, and tests. To better evaluate LLM's capability against JavaBench, we introduce a systematic evaluation design covering three context settings and five synthesis strategies at two granularities using three hierarchical metrics. Our extensive experiment yields several interesting findings. First, we noticed that regarding project-level Java programming, LLMs are far behind undergraduate students (no project can be correctly completed by any studied LLMs, and at most 41.17% Pass@5 in a more relaxed evaluation). Second, using method signature as prompt context may strike an ideal balance for project-level code generation. JavaBench is publicly available at <a class="link-external link-https" href="https://github.com/java-bench/JavaBench" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper attempts to address three significant imbalances in current code generation benchmarks: 1. **Imbalance in Programming Languages**: In existing benchmarks, 95.8% involve Python, while only 5 benchmarks involve Java. This leads to an insufficient understanding of large language models' (LLMs) ability to generate Java code. 2. **Imbalance in Code Granularity**: Most benchmarks focus on function-level or statement-level code generation, accounting for 83.3% of all benchmarks. Only a few benchmarks extend to class-level or project-level, and these are limited to Python. This imbalance restricts the evaluation of LLMs' ability to handle more complex code structures. 3. **Lack of Advanced Features**: Existing benchmarks mainly assess basic coding skills (such as variables, data types, operators, and control structures), while ignoring advanced features of object-oriented programming (OOP) (such as encapsulation, inheritance, and polymorphism). These advanced features are very common in actual Java project development, making it necessary to construct benchmarks that can test LLMs' handling of OOP features. To fill these gaps, the authors propose JavaBench, a project-level Java benchmark designed to evaluate LLMs' ability to handle OOP features (i.e., encapsulation, inheritance, and polymorphism). JavaBench includes 4 Java projects, with a total of 389 methods distributed across 106 Java classes, achieving a test coverage of 92%, and validated by 282 undergraduates with an average score of 90.93/100, ensuring the quality of documentation, code skeletons, and tests. Through a systematic evaluation design, the authors conducted extensive experiments on five LLMs under three context settings, five synthesis strategies, and two evaluation granularities, using three levels of evaluation metrics. The experimental results show that LLMs' project-level Java programming ability is far inferior to that of undergraduates, with the best LLM achieving only 41.7% Pass@5 (under test granularity) in the most ideal setting, while undergraduates achieved 90.93% under stricter evaluation. Additionally, the study found that providing method signatures as prompt context might achieve an ideal balance in project-level code generation.

JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language Models

CoderUJB: An Executable and Unified Java Benchmark for Practical Programming Scenarios

OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models

CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations

TestBench: Evaluating Class-Level Test Case Generation Capability of Large Language Models

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code

Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review

A Metrics-Based Comparative Study on Object-Oriented Programming Languages.

ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code

ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation

DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation

Evaluating Large Language Models in Class-Level Code Generation

CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation

Evaluating and Aligning CodeLLMs on Human Preference

CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery

Escalating LLM-based Code Translation Benchmarking into the Class-level Era