JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language Models

Jialun Cao,Zhiyong Chen,Jiarong Wu,Shing-chi Cheung,Chang Xu
2024-10-11
Abstract:Code generation benchmarks such as HumanEval are widely adopted to evaluate LLMs' capabilities. However, after consolidating the latest 24 benchmarks, we noticed three significant imbalances. First, imbalanced programming language. 95.8% of benchmarks involve Python, while only 5 benchmarks involve Java. Second, imbalanced code granularity. Function-/statement-level benchmarks account for over 83.3% of benchmarks. Only a mere handful extends to class-/project-levels, and all are limited to Python. Third, lacking advanced features. Existing benchmarks primarily assess basic coding skills, while overlooking advanced Object-Oriented Programming (OOP) features (i.e., encapsulation, inheritance, and polymorphism). To fill these gaps, we propose JavaBench, a project-level Java benchmark that exercises OOP features. It comprises four Java projects with 389 methods in 106 Java classes. The test coverage is up to 92%, and JavaBench is attested by 282 undergraduate students, reaching a 90.93/100 average score (i.e., pass rate against the test suite), ensuring the quality of documentation, code skeleton, and tests. To better evaluate LLM's capability against JavaBench, we introduce a systematic evaluation design covering three context settings and five synthesis strategies at two granularities using three hierarchical metrics. Our extensive experiment yields several interesting findings. First, we noticed that regarding project-level Java programming, LLMs are far behind undergraduate students (no project can be correctly completed by any studied LLMs, and at most 41.17% Pass@5 in a more relaxed evaluation). Second, using method signature as prompt context may strike an ideal balance for project-level code generation. JavaBench is publicly available at <a class="link-external link-https" href="https://github.com/java-bench/JavaBench" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Artificial Intelligence,Programming Languages,Software Engineering
What problem does this paper attempt to address?
The paper attempts to address three significant imbalances in current code generation benchmarks: 1. **Imbalance in Programming Languages**: In existing benchmarks, 95.8% involve Python, while only 5 benchmarks involve Java. This leads to an insufficient understanding of large language models' (LLMs) ability to generate Java code. 2. **Imbalance in Code Granularity**: Most benchmarks focus on function-level or statement-level code generation, accounting for 83.3% of all benchmarks. Only a few benchmarks extend to class-level or project-level, and these are limited to Python. This imbalance restricts the evaluation of LLMs' ability to handle more complex code structures. 3. **Lack of Advanced Features**: Existing benchmarks mainly assess basic coding skills (such as variables, data types, operators, and control structures), while ignoring advanced features of object-oriented programming (OOP) (such as encapsulation, inheritance, and polymorphism). These advanced features are very common in actual Java project development, making it necessary to construct benchmarks that can test LLMs' handling of OOP features. To fill these gaps, the authors propose JavaBench, a project-level Java benchmark designed to evaluate LLMs' ability to handle OOP features (i.e., encapsulation, inheritance, and polymorphism). JavaBench includes 4 Java projects, with a total of 389 methods distributed across 106 Java classes, achieving a test coverage of 92%, and validated by 282 undergraduates with an average score of 90.93/100, ensuring the quality of documentation, code skeletons, and tests. Through a systematic evaluation design, the authors conducted extensive experiments on five LLMs under three context settings, five synthesis strategies, and two evaluation granularities, using three levels of evaluation metrics. The experimental results show that LLMs' project-level Java programming ability is far inferior to that of undergraduates, with the best LLM achieving only 41.7% Pass@5 (under test granularity) in the most ideal setting, while undergraduates achieved 90.93% under stricter evaluation. Additionally, the study found that providing method signatures as prompt context might achieve an ideal balance in project-level code generation.