Abstract:Code generation models have increasingly become integral to aiding software development. Although current research has thoroughly examined the correctness of the code produced by code generation models, a vital aspect that plays a pivotal role in green computing and sustainability efforts has often been neglected. This paper presents EffiBench, a benchmark with 1,000 efficiency-critical coding problems to assess the efficiency of code generated by code generation models. EffiBench contains a diverse set of LeetCode coding problems. Each problem is paired with an executable human-written canonical solution, which obtains the SOTA efficiency on the LeetCode solution leaderboard. With EffiBench, we empirically examine the ability of 42 large language models (35 open-source and 7 closed-source) to generate efficient code. Our evaluation results demonstrate that the efficiency of the code generated by LLMs is generally worse than the efficiency of human-written canonical solutions. For example, GPT-4 generated code has an average \textbf{3.12} times execution time that of the human-written canonical solutions. In the most extreme cases, the execution time and total memory usage of GPT-4 generated code are \textbf{13.89} and \textbf{43.92} times that of the canonical solutions. The source code of EffiBench is released on <a class="link-external link-https" href="https://github.com/huangd1999/EffiBench" rel="external noopener nofollow">this https URL</a>. We also provide the LeaderBoard at <a class="link-external link-https" href="https://huggingface.co/spaces/EffiBench/effibench-leaderboard" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that, when current code - generation models generate code, although their correctness has been widely studied, the efficiency of the generated code is often overlooked. The paper points out that code efficiency is of great significance for green computing and sustainable development, especially in resource - constrained environments (such as mobile devices or embedded systems). Therefore, the paper proposes a benchmark named EFFIBENCH, which aims to evaluate the efficiency of the code generated by automatic code - generation models. EFFIBENCH contains 1,000 programming problems with high efficiency requirements, and each problem is equipped with an executable standard solution written manually. These solutions have obtained the optimal time and space efficiency scores on the LeetCode solution leaderboard. Through EFFIBENCH, the authors conducted an empirical study on 42 large - language models (including 35 open - source models and 7 closed - source models) to evaluate their ability to generate efficient code. The research results show that, compared with the manually written standard solutions, the code generated by large - language models is generally less efficient. For example, the average execution time of the code generated by GPT - 4 is 3.12 times that of the standard solution; in extreme cases, the execution time and total memory usage of the code generated by GPT - 4 are 13.89 times and 43.92 times that of the standard solution respectively. In addition, the paper also explores the relationship between the correctness and efficiency of code generation, and finds that a high pass@1 score (that is, the ability of the model to generate correct code on the first attempt) does not necessarily mean higher code efficiency. For example, although the pass@1 score of GPT - 4 - turbo - preview is higher than that of GPT - 4, its code efficiency is lower than that of GPT - 4. In conclusion, by proposing EFFIBENCH, this paper fills the gap in the existing research on the efficiency evaluation of code - generation models and provides an important benchmark and tool for future research.

EffiBench: Benchmarking the Efficiency of Automatically Generated Code

CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks

EffiLearner: Enhancing Efficiency of Generated Code via Self-Optimization

Effi-Code: Unleashing Code Efficiency in Language Models

On Evaluating the Efficiency of Source Code Generated by LLMs

EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark

Evaluating Language Models for Efficient Code Generation

Measuring Code Efficiency Optimization Capabilities with ACEOB

ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness?

PPM: Automated Generation of Diverse Programming Problems for Benchmarking Code Generation Models

Mercury: A Code Efficiency Benchmark for Code Large Language Models

NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts

ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation

The Fault in our Stars: Quality Assessment of Code Generation Benchmarks

ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code

CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings

Benchmarking Language Model Creativity: A Case Study on Code Generation

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

From Effectiveness to Efficiency: Comparative Evaluation of Code Generated by LCGMs for Bilingual Programming Questions