EffiBench: Benchmarking the Efficiency of Automatically Generated Code

Dong Huang,Yuhao Qing,Weiyi Shang,Heming Cui,Jie M.Zhang
2024-10-06
Abstract:Code generation models have increasingly become integral to aiding software development. Although current research has thoroughly examined the correctness of the code produced by code generation models, a vital aspect that plays a pivotal role in green computing and sustainability efforts has often been neglected. This paper presents EffiBench, a benchmark with 1,000 efficiency-critical coding problems to assess the efficiency of code generated by code generation models. EffiBench contains a diverse set of LeetCode coding problems. Each problem is paired with an executable human-written canonical solution, which obtains the SOTA efficiency on the LeetCode solution leaderboard. With EffiBench, we empirically examine the ability of 42 large language models (35 open-source and 7 closed-source) to generate efficient code. Our evaluation results demonstrate that the efficiency of the code generated by LLMs is generally worse than the efficiency of human-written canonical solutions. For example, GPT-4 generated code has an average \textbf{3.12} times execution time that of the human-written canonical solutions. In the most extreme cases, the execution time and total memory usage of GPT-4 generated code are \textbf{13.89} and \textbf{43.92} times that of the canonical solutions. The source code of EffiBench is released on <a class="link-external link-https" href="https://github.com/huangd1999/EffiBench" rel="external noopener nofollow">this https URL</a>. We also provide the LeaderBoard at <a class="link-external link-https" href="https://huggingface.co/spaces/EffiBench/effibench-leaderboard" rel="external noopener nofollow">this https URL</a>.
Software Engineering,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that, when current code - generation models generate code, although their correctness has been widely studied, the efficiency of the generated code is often overlooked. The paper points out that code efficiency is of great significance for green computing and sustainable development, especially in resource - constrained environments (such as mobile devices or embedded systems). Therefore, the paper proposes a benchmark named EFFIBENCH, which aims to evaluate the efficiency of the code generated by automatic code - generation models. EFFIBENCH contains 1,000 programming problems with high efficiency requirements, and each problem is equipped with an executable standard solution written manually. These solutions have obtained the optimal time and space efficiency scores on the LeetCode solution leaderboard. Through EFFIBENCH, the authors conducted an empirical study on 42 large - language models (including 35 open - source models and 7 closed - source models) to evaluate their ability to generate efficient code. The research results show that, compared with the manually written standard solutions, the code generated by large - language models is generally less efficient. For example, the average execution time of the code generated by GPT - 4 is 3.12 times that of the standard solution; in extreme cases, the execution time and total memory usage of the code generated by GPT - 4 are 13.89 times and 43.92 times that of the standard solution respectively. In addition, the paper also explores the relationship between the correctness and efficiency of code generation, and finds that a high pass@1 score (that is, the ability of the model to generate correct code on the first attempt) does not necessarily mean higher code efficiency. For example, although the pass@1 score of GPT - 4 - turbo - preview is higher than that of GPT - 4, its code efficiency is lower than that of GPT - 4. In conclusion, by proposing EFFIBENCH, this paper fills the gap in the existing research on the efficiency evaluation of code - generation models and provides an important benchmark and tool for future research.