OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models

Shuai Wang,Liang Ding,Li Shen,Yong Luo,Bo Du,Dacheng Tao
2024-02-21
Abstract:Advancing automated programming necessitates robust and comprehensive code generation benchmarks, yet current evaluation frameworks largely neglect object-oriented programming (OOP) in favor of functional programming (FP), e.g., HumanEval and MBPP. To address this, our study introduces a pioneering OOP-focused benchmark, featuring 431 Python programs that encompass essential OOP concepts and features like classes and encapsulation methods. We propose a novel evaluation metric, pass@o, tailored for OOP, enhancing traditional pass@k measures. Our evaluation of 23 leading large language models (LLMs), including both general and code-specialized models, reveals three key insights: 1) pass@o offers a more relevant and comprehensive assessment for OOP code generation; 2) Despite excelling in FP, code-specialized LLMs like WizardCoder lag in OOP compared to models like ChatGPT; 3) The poor performance of all advanced LLMs on our OOP benchmark highlights a critical need for improvements in this field. Our benchmark and scripts are publicly released at:
Computation and Language
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are the poor performance of current large - language models (LLMs) in object - oriented programming (OOP) tasks, and that the existing code - generation evaluation benchmarks mainly focus on functional programming (FP) while ignoring the evaluation of OOP - related concepts and features. Specifically: 1. **Lack of OOP evaluation benchmarks**: Existing code - generation evaluation benchmarks such as HumanEval and MBPP mainly focus on functional programming and fail to fully cover the key concepts and features of object - oriented programming, such as classes, inheritance, encapsulation methods, etc. As a result, even models that perform well on these benchmarks may perform poorly in actual OOP tasks. 2. **Limitations of evaluation metrics**: Existing evaluation metrics such as pass@k are mainly used to evaluate the executability of the generated code, but cannot effectively reflect whether the model has generated concepts and features related to OOP, such as class names, private function names, etc. To solve these problems, the paper proposes the following solutions: 1. **Construct an OOP evaluation benchmark**: The paper constructs an OOP evaluation benchmark containing 431 Python programs, covering key concepts and features of OOP, such as classes, inheritance, encapsulation methods, etc. 2. **Propose a new evaluation metric pass@o**: In order to more comprehensively evaluate the OOP code - generation task, the paper proposes a new evaluation metric pass@o. This metric is optimized based on the traditional pass@k and evaluates the OOP ability of the model by matching the key points in the natural - language description with the key points in the programming language (such as class names, private function names, etc.). 3. **Widely evaluate existing LLMs**: The paper conducts a wide - range evaluation of 23 mainstream large - language models, including general - purpose models and code - professional models, reveals the performance deficiencies of current LLMs in OOP tasks, and points out the direction for improvement. Through these measures, the paper aims to promote the research and development of object - oriented programming in the field of automatic programming and provide the community with a more comprehensive and fairer evaluation tool.