Abstract:This study evaluates the efficiency of code generation by Large Language Models (LLMs) and measures their performance against human-crafted solutions using a dataset from Leetcode. We compare 18 LLMs, considering factors such as model temperature and success rate, and their impact on code performance. This research introduces a novel method for measuring and comparing the speed of LLM-generated code, revealing that LLMs produce code with comparable performance, irrespective of the adopted LLM. We also find that LLMs are capable of generating code that is, on average, more efficient than the code written by humans. The paper further discusses the use of Leetcode as a benchmarking dataset, the limitations imposed by potential data contamination, and the platform's measurement reliability. We believe that our findings contribute to a better understanding of LLM capabilities in code generation and set the stage for future optimizations in the field.

What problem does this paper attempt to address?

The problems that this paper attempts to solve can be summarized as the following points: 1. **The effectiveness of Leetcode as a data set and benchmark platform**: - Research whether Leetcode can be used as an effective data set and evaluation platform to test the code - generating ability of large - language models (LLMs). Specifically, the researchers explored whether there is a paraphrasing phenomenon in the Leetcode data set (that is, whether the LLMs have seen these problems in the training data), and whether the performance metrics provided by Leetcode are reliable. 2. **Performance differences in code generated by different LLMs**: - Compare the performance differences in the code generated by 18 different LLMs. The researchers evaluated the performance of these models by measuring the running time and memory usage of the generated code. 3. **The relationship between the performance of code generated by LLMs and temperature and success rate**: - Explore how the performance of code generated by LLMs is affected by the model temperature and success rate. The temperature parameter affects the diversity of code generated by LLMs, while the success rate reflects the ability of LLMs to generate valid code. 4. **The efficiency of code generated by LLMs compared with human - written code**: - Compare the performance differences between code generated by LLMs and human - written code to evaluate the actual performance of LLMs in code - generation tasks. ### Specific research questions (RQs) 1. **RQ1**: Can Leetcode be used as a data set and benchmark platform for evaluating LLMs? - Research whether there is a paraphrasing phenomenon in the Leetcode data set and whether the performance metrics provided by Leetcode are reliable. 2. **RQ2**: Are there significant performance differences in the code generated by different LLMs? - Evaluate the performance differences between them by comparing the running time and memory usage of the code generated by different LLMs. 3. **RQ3**: How do the temperature and success rate of LLMs affect the performance of the generated code? - Explore the impact of the temperature parameter and success rate on the performance of code generated by LLMs. 4. **RQ4**: How efficient is the code generated by LLMs compared with human - written code? - Compare the performance differences between code generated by LLMs and human - written code to evaluate the actual performance of LLMs. ### Method overview To answer the above research questions, the researchers adopted the following methods: - **Data set selection**: Use programming problems on Leetcode as the data set, including 204 newly released problems and a data set containing 300 old problems. - **Model selection**: Select 18 LLMs specifically for code generation for the experiment. - **Code generation**: Generate multiple solutions by adjusting the temperature parameters of LLMs and verify their correctness using Leetcode's online evaluation system. - **Performance evaluation**: Use pytest - benchmark to measure the running time of the generated code and evaluate its performance through Leetcode's ranking system. ### Main findings - **The effectiveness of the Leetcode data set**: The study found that the old data set has a serious paraphrasing phenomenon, while the new data set does not have this problem, indicating that Leetcode can be used as an effective evaluation platform. - **The performance of code generated by LLMs**: There are significant performance differences in the code generated by different LLMs, but overall, the code generated by LLMs is more efficient than human - written code in some cases. - **The influence of temperature and success rate**: The temperature parameter has a certain influence on the performance of the generated code, but it is not a decisive factor. LLMs with a high success rate usually generate code with better performance. - **Comparison with human code**: The code generated by LLMs is more efficient than human - written code in some cases, especially in cases of limited resources or large - scale deployment. These findings help to better understand the capabilities and limitations of LLMs in code - generation tasks and provide directions for future optimization.

A Performance Study of LLM-Generated Code on Leetcode

On Evaluating the Efficiency of Source Code Generated by LLMs

CodeJudge: Evaluating Code Generation with Large Language Models

How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark

Performance-Aligned LLMs for Generating Fast Code

Large Language Models as Test Case Generators: Performance Evaluation and Enhancement

A Survey on Evaluating Large Language Models in Code Generation Tasks

Evaluating Large Language Models in Class-Level Code Generation

Escalating LLM-based Code Translation Benchmarking into the Class-level Era

LLM-Assisted Code Cleaning For Training Accurate Code Generators

A Survey of Large Language Models for Code: Evolution, Benchmarking, and Future Trends

An evaluation of LLM code generation capabilities through graded exercises

LLM4DS: Evaluating Large Language Models for Data Science Code Generation

ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation

A Survey on Large Language Models for Code Generation

Evaluating Language Models for Efficient Code Generation

Can LLM Replace Stack Overflow? A Study on Robustness and Reliability of Large Language Model Code Generation

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

How Well Do LLMs Generate Code for Different Application Domains? Benchmark and Evaluation

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study