On Evaluating the Efficiency of Source Code Generated by LLMs

Changan Niu,Ting Zhang,Chuanyi Li,Bin Luo,Vincent Ng
2024-04-09
Abstract:Recent years have seen the remarkable capabilities of large language models (LLMs) for code generation. Different from existing work that evaluate the correctness of the code generated by LLMs, we propose to further evaluate its efficiency. More efficient code can lead to higher performance and execution efficiency of programs and software completed by LLM-assisted programming. First, we evaluate the efficiency of the code generated by LLMs on two benchmarks, HumanEval and MBPP. Then, we choose a set of programming problems from the online judge platform LeetCode to conduct a more difficult evaluation. Finally, we explore several prompts that would enable LLMs to generate more efficient code.
Software Engineering
What problem does this paper attempt to address?
The paper attempts to address the issue of evaluating the efficiency of code generated by large language models (LLMs). While existing work primarily focuses on the correctness of the code generated by LLMs, this paper further explores the execution efficiency of such code. Specifically, the authors pose the following research questions: 1. **RQ1**: How efficient is the code generated by LLMs? 2. **RQ2**: How can prompts be used to make LLMs generate more efficient code? To answer these questions, the authors conducted the following work: 1. **Dataset Selection**: - Used two introductory programming benchmark datasets: HumanEval and MBPP. - Constructed a new benchmark dataset based on the LeetCode platform, named LeetCodeEval, which includes programming problems of varying difficulty levels. 2. **Experimental Design**: - On HumanEval and MBPP, evaluated the efficiency of the generated code by running it and measuring its execution time. - On LeetCodeEval, submitted the generated C++ code to the LeetCode platform to obtain its correctness and runtime. 3. **Model Selection**: - Selected multiple commercial and open-source LLMs, including GPT-3.5, GPT-4, Phi-2, CodeLlama, WizardCoder, and DeepSeek Coder. 4. **Evaluation Metrics**: - Reported the average normalized runtime and Pass@10 metrics. - For LeetCodeEval, also used the average beat percentage (i.e., the percentage of other users' code that each accepted code segment outperformed). 5. **Results Analysis**: - Found that the ability to generate correct code is not always positively correlated with the ability to generate efficient code. - The number of model parameters does not necessarily guarantee higher performance. - Training strategies and data have a significant impact on the efficiency of the generated code. - Performance differences across different benchmark datasets may be related to the model's data distribution and dataset characteristics. 6. **Prompt Methods**: - Tried three different prompt methods to explore how to make LLMs generate more efficient code. - Results showed that step-by-step prompting methods work better on complex problems. Overall, through systematic experiments and analysis, the paper reveals the efficiency issues of code generated by LLMs and proposes some improvement methods, providing valuable references for future research.