Abstract:Advancing automated programming necessitates robust and comprehensive code generation benchmarks, yet current evaluation frameworks largely neglect object-oriented programming (OOP) in favor of functional programming (FP), e.g., HumanEval and MBPP. To address this, our study introduces a pioneering OOP-focused benchmark, featuring 431 Python programs that encompass essential OOP concepts and features like classes and encapsulation methods. We propose a novel evaluation metric, pass@o, tailored for OOP, enhancing traditional pass@k measures. Our evaluation of 23 leading large language models (LLMs), including both general and code-specialized models, reveals three key insights: 1) pass@o offers a more relevant and comprehensive assessment for OOP code generation; 2) Despite excelling in FP, code-specialized LLMs like WizardCoder lag in OOP compared to models like ChatGPT; 3) The poor performance of all advanced LLMs on our OOP benchmark highlights a critical need for improvements in this field. Our benchmark and scripts are publicly released at:

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are the poor performance of current large - language models (LLMs) in object - oriented programming (OOP) tasks, and that the existing code - generation evaluation benchmarks mainly focus on functional programming (FP) while ignoring the evaluation of OOP - related concepts and features. Specifically: 1. **Lack of OOP evaluation benchmarks**: Existing code - generation evaluation benchmarks such as HumanEval and MBPP mainly focus on functional programming and fail to fully cover the key concepts and features of object - oriented programming, such as classes, inheritance, encapsulation methods, etc. As a result, even models that perform well on these benchmarks may perform poorly in actual OOP tasks. 2. **Limitations of evaluation metrics**: Existing evaluation metrics such as pass@k are mainly used to evaluate the executability of the generated code, but cannot effectively reflect whether the model has generated concepts and features related to OOP, such as class names, private function names, etc. To solve these problems, the paper proposes the following solutions: 1. **Construct an OOP evaluation benchmark**: The paper constructs an OOP evaluation benchmark containing 431 Python programs, covering key concepts and features of OOP, such as classes, inheritance, encapsulation methods, etc. 2. **Propose a new evaluation metric pass@o**: In order to more comprehensively evaluate the OOP code - generation task, the paper proposes a new evaluation metric pass@o. This metric is optimized based on the traditional pass@k and evaluates the OOP ability of the model by matching the key points in the natural - language description with the key points in the programming language (such as class names, private function names, etc.). 3. **Widely evaluate existing LLMs**: The paper conducts a wide - range evaluation of 23 mainstream large - language models, including general - purpose models and code - professional models, reveals the performance deficiencies of current LLMs in OOP tasks, and points out the direction for improvement. Through these measures, the paper aims to promote the research and development of object - oriented programming in the field of automatic programming and provide the community with a more comprehensive and fairer evaluation tool.

OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models

JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language Models

Strategic Optimization and Challenges of Large Language Models in Object-Oriented Programming

CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models

A Picture Is Worth a Thousand Words: Exploring Diagram and Video-Based OOP Exercises to Counter LLM Over-Reliance

A Metrics-Based Comparative Study on Object-Oriented Programming Languages.

Evaluating Large Language Models in Class-Level Code Generation

MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code Generation

EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations

CoderUJB: An Executable and Unified Java Benchmark for Practical Programming Scenarios

Unseen Horizons: Unveiling the Real Capability of LLM Code Generation Beyond the Familiar

ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

Planning-Driven Programming: A Large Language Model Programming Workflow

Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

From Code to Play: Benchmarking Program Search for Games Using Large Language Models

Evaluating Language Models for Generating and Judging Programming Feedback

ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study