Abstract:In recent years, researchers have proposed numerous benchmarks to evaluate the impressive coding capabilities of large language models (LLMs). However, current benchmarks primarily assess the accuracy of LLM-generated code, while neglecting other critical dimensions that also significantly impact code quality in real-world development. Moreover, relying exclusively on correctness as the guiding metric renders LLMs susceptible to data contamination. Therefore, this paper proposes the RACE benchmark, which comprehensively evaluates the quality of code generated by LLMs across 4 dimensions: Readability, mAintainability, Correctness, and Efficiency. Specifically, considering the demand-dependent nature of dimensions beyond correctness, we design various types of user requirements for each dimension to assess the model's ability to generate correct code that also meets user demands. We analyze 28 representative LLMs based on RACE and find that: 1) current correctness-centric benchmarks fail to capture the multifaceted requirements of code in real-world scenarios, while RACE provides a comprehensive evaluation that reveals the defects of LLMs across multiple dimensions; 2) the RACE benchmark serves as an effective tool for resisting the risk of data contamination; 3) even the most advanced code LLMs still encounter significant challenges in customized requirements involving complex instructions; 4) most LLMs exhibit an inherent preference for specific coding style. These findings highlight the need for a multidimensional evaluation of code LLMs, emphasizing metrics beyond correctness for real-world applications. Future efforts should aim to develop novel learning algorithms to enhance code generation under varied constraints and improve coverage and usability for diverse user needs.

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve This paper aims to address some key issues present in the current benchmarks for evaluating code generation by large language models (LLMs). Specifically, the current benchmarks mainly focus on assessing the correctness of the code generated by LLMs, while neglecting other equally important dimensions that significantly impact code quality in real-world development scenarios. These issues include: 1. **Limitations of Single Correctness Evaluation**: Existing benchmarks primarily focus on the correctness of the code, ignoring other important dimensions such as readability, maintainability, and efficiency. This single evaluation standard cannot fully reflect the performance of the code in practical applications. 2. **Risk of Data Contamination**: Over-reliance on correctness as an evaluation metric may lead to LLMs overfitting the training data, resulting in the generation of code during inference that is highly similar to the training data, leading to data leakage. 3. **Insufficient Support for Customization Needs**: Existing models face significant challenges in handling complex instructions and meeting specific user requirements, especially in terms of customization across multiple dimensions. To address these issues, the paper proposes a new benchmark—RACE (Readability, mAintainability, Correctness, and Efficiency), which aims to comprehensively evaluate the quality of code generated by LLMs from multiple dimensions. The RACE benchmark not only covers the correctness of the code but also assesses its readability, maintainability, and efficiency, and designs various user requirements to test the model's ability to generate code that meets specific requirements. Through the evaluation of the RACE benchmark, the paper reveals the deficiencies of current LLMs in multi-dimensional evaluation and emphasizes the importance of developing new learning algorithms to enhance code generation capabilities. Future research should focus on improving the code generation capabilities of LLMs under different constraints, increasing their coverage and usability for diverse user needs.

Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models

ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code

CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

Beyond Functional Correctness: Investigating Coding Style Inconsistencies in Large Language Models

Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review

CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models

Is Functional Correctness Enough to Evaluate Code Language Models? Exploring Diversity of Generated Codes

Evaluating and Aligning CodeLLMs on Human Preference

Beyond Code Generation: Assessing Code LLM Maturity with Postconditions

Insights from Benchmarking Frontier Language Models on Web App Code Generation

Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models

How Well Do LLMs Generate Code for Different Application Domains? Benchmark and Evaluation

Evaluating Large Language Models in Class-Level Code Generation

CodeJudge: Evaluating Code Generation with Large Language Models

A Survey on Evaluating Large Language Models in Code Generation Tasks

Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence

EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations

From Effectiveness to Efficiency: Comparative Evaluation of Code Generated by LCGMs for Bilingual Programming Questions