Abstract:This paper provides a comprehensive review of the current methods and metrics used to evaluate the performance of Large Language Models (LLMs) in code generation tasks. With the rapid growth in demand for automated software development, LLMs have demonstrated significant potential in the field of code generation. The paper begins by reviewing the historical development of LLMs and their applications in code generation. Next, it details various methods and metrics for assessing the code generation capabilities of LLMs, including code correctness, efficiency, readability, and evaluation methods based on expert review and user experience. The paper also evaluates the widely used benchmark datasets, identifying their limitations and proposing directions for future improvements. Specifically, the paper analyzes the performance of code generation models across different tasks by combining multiple evaluation metrics, such as code compilation/interpretation success rates, unit test pass rates, and performance and efficiency metrics, to comprehensively assess the practical application of LLMs in code generation. Finally, the paper discusses the challenges faced in evaluating LLMs in code generation, particularly how to ensure the comprehensiveness and accuracy of evaluation methods and how to adapt to the evolving practices of software development. These analyses and discussions provide valuable insights for further optimizing and improving the application of LLMs in code generation tasks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: How to comprehensively and systematically evaluate the performance of large language models (LLMs) in code - generation tasks. With the rapid growth of the demand for automated software development, large language models have shown great potential in the field of code generation. However, at present, there is a lack of a comprehensive and systematic evaluation method to measure the performance of these models in practical applications. Specifically, the paper focuses on the following aspects of problems: 1. **Historical Development and Application**: - Review the historical development of large language models and their applications in code generation to understand their evolution paths and existing capabilities. 2. **Evaluation Methods and Metrics**: - Discuss in detail various methods and metrics used to evaluate code - generation models, including code correctness, efficiency, readability, etc. - Analyze evaluation methods based on expert review and user experience to ensure the comprehensiveness and accuracy of the evaluation. 3. **Limitations of Benchmark Datasets**: - Evaluate widely - used benchmark datasets (such as HumanEval, MBPP, CodeXGLUE, etc.), identify their limitations, and put forward improvement suggestions. 4. **Multi - dimensional Evaluation**: - Combine multiple evaluation metrics (such as compilation/interpretation success rate, unit test pass rate, performance and efficiency metrics, etc.) to comprehensively evaluate the performance of large language models in different tasks. 5. **Challenges and Future Directions**: - Discuss the challenges faced when evaluating large language models in code generation, such as ensuring the comprehensiveness and accuracy of evaluation methods, and adapting to the ever - changing software development practices. - Look forward to future research directions, including aspects such as scalability, multi - language generalization ability, security and robustness. By solving these problems, the paper aims to provide valuable insights for researchers and practitioners, thereby further optimizing and improving the application of large language models in code - generation tasks.

A Survey on Evaluating Large Language Models in Code Generation Tasks

A Survey on Large Language Models for Code Generation

Evaluating Large Language Models in Class-Level Code Generation

A Survey on Evaluation of Large Language Models

A Survey of Large Language Models for Code: Evolution, Benchmarking, and Future Trends

A Survey on Evaluation of Large Language ModelsJust Accepted

A Review on Code Generation with LLMs: Application and Evaluation

CodeJudge: Evaluating Code Generation with Large Language Models

Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review

On the Evaluation of Large Language Models in Unit Test Generation

Where Do Large Language Models Fail When Generating Code?

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

On the Effectiveness of Large Language Models in Domain-Specific Code Generation

L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models

VerilogEval: Evaluating Large Language Models for Verilog Code Generation

Leveraging Large Language Models for NLG Evaluation: Advances and Challenges

Large Language Models for Code Analysis: Do LLMs Really Do Their Job?

Evaluating Large Language Models: A Comprehensive Survey

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks