Abstract:Recent years have seen the remarkable capabilities of large language models (LLMs) for code generation. Different from existing work that evaluate the correctness of the code generated by LLMs, we propose to further evaluate its efficiency. More efficient code can lead to higher performance and execution efficiency of programs and software completed by LLM-assisted programming. First, we evaluate the efficiency of the code generated by LLMs on two benchmarks, HumanEval and MBPP. Then, we choose a set of programming problems from the online judge platform LeetCode to conduct a more difficult evaluation. Finally, we explore several prompts that would enable LLMs to generate more efficient code.

What problem does this paper attempt to address?

The paper attempts to address the issue of evaluating the efficiency of code generated by large language models (LLMs). While existing work primarily focuses on the correctness of the code generated by LLMs, this paper further explores the execution efficiency of such code. Specifically, the authors pose the following research questions: 1. **RQ1**: How efficient is the code generated by LLMs? 2. **RQ2**: How can prompts be used to make LLMs generate more efficient code? To answer these questions, the authors conducted the following work: 1. **Dataset Selection**: - Used two introductory programming benchmark datasets: HumanEval and MBPP. - Constructed a new benchmark dataset based on the LeetCode platform, named LeetCodeEval, which includes programming problems of varying difficulty levels. 2. **Experimental Design**: - On HumanEval and MBPP, evaluated the efficiency of the generated code by running it and measuring its execution time. - On LeetCodeEval, submitted the generated C++ code to the LeetCode platform to obtain its correctness and runtime. 3. **Model Selection**: - Selected multiple commercial and open-source LLMs, including GPT-3.5, GPT-4, Phi-2, CodeLlama, WizardCoder, and DeepSeek Coder. 4. **Evaluation Metrics**: - Reported the average normalized runtime and Pass@10 metrics. - For LeetCodeEval, also used the average beat percentage (i.e., the percentage of other users' code that each accepted code segment outperformed). 5. **Results Analysis**: - Found that the ability to generate correct code is not always positively correlated with the ability to generate efficient code. - The number of model parameters does not necessarily guarantee higher performance. - Training strategies and data have a significant impact on the efficiency of the generated code. - Performance differences across different benchmark datasets may be related to the model's data distribution and dataset characteristics. 6. **Prompt Methods**: - Tried three different prompt methods to explore how to make LLMs generate more efficient code. - Results showed that step-by-step prompting methods work better on complex problems. Overall, through systematic experiments and analysis, the paper reveals the efficiency issues of code generated by LLMs and proposes some improvement methods, providing valuable references for future research.

On Evaluating the Efficiency of Source Code Generated by LLMs

How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark

A Performance Study of LLM-Generated Code on Leetcode

CodeJudge: Evaluating Code Generation with Large Language Models

From Effectiveness to Efficiency: Comparative Evaluation of Code Generated by LCGMs for Bilingual Programming Questions

Evaluating Large Language Models in Class-Level Code Generation

Evaluating Language Models for Efficient Code Generation

Framework for evaluating code generation ability of large language models

Effi-Code: Unleashing Code Efficiency in Language Models

Towards more realistic evaluation of LLM-based code generation: an experimental study and beyond

Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review

A Survey on Evaluating Large Language Models in Code Generation Tasks

EffiBench: Benchmarking the Efficiency of Automatically Generated Code

LLM-Assisted Code Cleaning For Training Accurate Code Generators

A Controlled Experiment on the Energy Efficiency of the Source Code Generated by Code Llama

The First Prompt Counts the Most! An Evaluation of Large Language Models on Iterative Example-based Code Generation

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

DevEval: Evaluating Code Generation in Practical Software Projects

Benchmarking and Explaining Large Language Model-based Code Generation: A Causality-Centric Approach