Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models

Jiasheng Zheng,Boxi Cao,Zhengzhao Ma,Ruotong Pan,Hongyu Lin,Yaojie Lu,Xianpei Han,Le Sun
2024-10-09
Abstract:In recent years, researchers have proposed numerous benchmarks to evaluate the impressive coding capabilities of large language models (LLMs). However, current benchmarks primarily assess the accuracy of LLM-generated code, while neglecting other critical dimensions that also significantly impact code quality in real-world development. Moreover, relying exclusively on correctness as the guiding metric renders LLMs susceptible to data contamination. Therefore, this paper proposes the RACE benchmark, which comprehensively evaluates the quality of code generated by LLMs across 4 dimensions: Readability, mAintainability, Correctness, and Efficiency. Specifically, considering the demand-dependent nature of dimensions beyond correctness, we design various types of user requirements for each dimension to assess the model's ability to generate correct code that also meets user demands. We analyze 28 representative LLMs based on RACE and find that: 1) current correctness-centric benchmarks fail to capture the multifaceted requirements of code in real-world scenarios, while RACE provides a comprehensive evaluation that reveals the defects of LLMs across multiple dimensions; 2) the RACE benchmark serves as an effective tool for resisting the risk of data contamination; 3) even the most advanced code LLMs still encounter significant challenges in customized requirements involving complex instructions; 4) most LLMs exhibit an inherent preference for specific coding style. These findings highlight the need for a multidimensional evaluation of code LLMs, emphasizing metrics beyond correctness for real-world applications. Future efforts should aim to develop novel learning algorithms to enhance code generation under varied constraints and improve coverage and usability for diverse user needs.
Software Engineering,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve This paper aims to address some key issues present in the current benchmarks for evaluating code generation by large language models (LLMs). Specifically, the current benchmarks mainly focus on assessing the correctness of the code generated by LLMs, while neglecting other equally important dimensions that significantly impact code quality in real-world development scenarios. These issues include: 1. **Limitations of Single Correctness Evaluation**: Existing benchmarks primarily focus on the correctness of the code, ignoring other important dimensions such as readability, maintainability, and efficiency. This single evaluation standard cannot fully reflect the performance of the code in practical applications. 2. **Risk of Data Contamination**: Over-reliance on correctness as an evaluation metric may lead to LLMs overfitting the training data, resulting in the generation of code during inference that is highly similar to the training data, leading to data leakage. 3. **Insufficient Support for Customization Needs**: Existing models face significant challenges in handling complex instructions and meeting specific user requirements, especially in terms of customization across multiple dimensions. To address these issues, the paper proposes a new benchmark—RACE (Readability, mAintainability, Correctness, and Efficiency), which aims to comprehensively evaluate the quality of code generated by LLMs from multiple dimensions. The RACE benchmark not only covers the correctness of the code but also assesses its readability, maintainability, and efficiency, and designs various user requirements to test the model's ability to generate code that meets specific requirements. Through the evaluation of the RACE benchmark, the paper reveals the deficiencies of current LLMs in multi-dimensional evaluation and emphasizes the importance of developing new learning algorithms to enhance code generation capabilities. Future research should focus on improving the code generation capabilities of LLMs under different constraints, increasing their coverage and usability for diverse user needs.