Abstract:Large Language Models (LLMs) have shown great potential in code generation. However, current LLMs still cannot reliably generate correct code. Moreover, it is unclear what kinds of code generation errors LLMs can make. To address this, we conducted an empirical study to analyze incorrect code snippets generated by six popular LLMs on the HumanEval dataset. We analyzed these errors alongside two dimensions of error characteristics -- semantic characteristics and syntactic characteristics -- to derive a comprehensive code generation error taxonomy for LLMs through open coding and thematic analysis. We then labeled all 557 incorrect code snippets based on this taxonomy. Our results showed that the six LLMs exhibited similar distributions of syntactic characteristics while different distributions of semantic characteristics. Furthermore, we analyzed the correlation between different error characteristics and factors such as task complexity, code length, and test-pass rate. Finally, we highlight the challenges that LLMs may encounter when generating code and propose implications for future research on reliable code generation with LLMs.

What problem does this paper attempt to address?

### The Problems the Paper Attempts to Solve This paper aims to explore the failure cases of large language models (LLMs) in code generation. Despite the significant potential of LLMs in generating code, they still fail to reliably produce correct code. Additionally, it is currently unclear what types of errors LLMs make when generating code. To fill this knowledge gap, the authors conducted an empirical study analyzing the incorrect code snippets generated by six popular LLMs on the HumanEval dataset. Specifically, the paper attempts to answer the following questions: 1. **What types of code errors are generated by different LLMs?** This question aims to reveal the common characteristics and differences in code errors generated by different LLMs, helping researchers understand whether general methods can be developed to improve LLMs or if these models require specialized handling. 2. **How much effort is needed to fix code generation errors?** In practice, it is unrealistic to expect LLMs to generate completely correct code in all cases. Existing research suggests that some incorrect code can still serve as a useful starting point for developers. Therefore, understanding the effort required to fix incorrect solutions and the possibility of automated fixes is very important. 3. **How does task complexity affect LLMs' code generation?** Intuitively, complex tasks are harder to solve than simple tasks. However, it is unclear whether different LLMs have varying code generation capabilities when solving tasks of different complexities. Specifically, understanding the upper limit of task complexity that LLMs can gracefully handle can guide or estimate the effort needed for code review, testing, and fixing. 4. **What is the relationship between code length and the types of errors generated by LLMs?** Unlike the third question, this question aims to investigate whether the length of the generated code affects the correctness of the code snippets. It particularly focuses on longer code solutions and their error characteristics. 5. **Do partially failing codes have different characteristics compared to completely failing codes?** This question explores the differences between code that fails some test cases and code that fails all test cases, providing insights into the specific challenges of achieving complete correctness. By answering these questions, the paper hopes to provide insights and improvement suggestions for future LLMs code generation research.

Where Do Large Language Models Fail When Generating Code?

An Empirical Study of Code Generation Errors made by Large Language Models

A Deep Dive into Large Language Model Code Generation Mistakes: What and Why?

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

Do Large Language Models Pay Similar Attention Like Human Programmers When Generating Code?

Fixing Code Generation Errors for Large Language Models

A Survey on Evaluating Large Language Models in Code Generation Tasks

Imperfect Code Generation: Uncovering Weaknesses in Automatic Code Generation by Large Language Models

Bugs in Large Language Models Generated Code: An Empirical Study

Evaluating Large Language Models in Class-Level Code Generation

Understanding Defects in Generated Codes by Language Models

Where Are Large Language Models for Code Generation on GitHub?

CodeJudge: Evaluating Code Generation with Large Language Models

Examination of Code generated by Large Language Models

The First Prompt Counts the Most! An Evaluation of Large Language Models on Iterative Example-based Code Generation

Large Language Models of Code Fail at Completing Code with Potential Bugs

Large Language Models for Code Analysis: Do LLMs Really Do Their Job?

Can Large Language Models Generate Geospatial Code?

Helping LLMs Improve Code Generation Using Feedback from Testing and Static Analysis

Can LLM Replace Stack Overflow? A Study on Robustness and Reliability of Large Language Model Code Generation

A Survey on Large Language Models for Code Generation