Where Do Large Language Models Fail When Generating Code?

Zhijie Wang,Zijie Zhou,Da Song,Yuheng Huang,Shengmai Chen,Lei Ma,Tianyi Zhang
2024-08-07
Abstract:Large Language Models (LLMs) have shown great potential in code generation. However, current LLMs still cannot reliably generate correct code. Moreover, it is unclear what kinds of code generation errors LLMs can make. To address this, we conducted an empirical study to analyze incorrect code snippets generated by six popular LLMs on the HumanEval dataset. We analyzed these errors alongside two dimensions of error characteristics -- semantic characteristics and syntactic characteristics -- to derive a comprehensive code generation error taxonomy for LLMs through open coding and thematic analysis. We then labeled all 557 incorrect code snippets based on this taxonomy. Our results showed that the six LLMs exhibited similar distributions of syntactic characteristics while different distributions of semantic characteristics. Furthermore, we analyzed the correlation between different error characteristics and factors such as task complexity, code length, and test-pass rate. Finally, we highlight the challenges that LLMs may encounter when generating code and propose implications for future research on reliable code generation with LLMs.
Software Engineering
What problem does this paper attempt to address?
### The Problems the Paper Attempts to Solve This paper aims to explore the failure cases of large language models (LLMs) in code generation. Despite the significant potential of LLMs in generating code, they still fail to reliably produce correct code. Additionally, it is currently unclear what types of errors LLMs make when generating code. To fill this knowledge gap, the authors conducted an empirical study analyzing the incorrect code snippets generated by six popular LLMs on the HumanEval dataset. Specifically, the paper attempts to answer the following questions: 1. **What types of code errors are generated by different LLMs?** This question aims to reveal the common characteristics and differences in code errors generated by different LLMs, helping researchers understand whether general methods can be developed to improve LLMs or if these models require specialized handling. 2. **How much effort is needed to fix code generation errors?** In practice, it is unrealistic to expect LLMs to generate completely correct code in all cases. Existing research suggests that some incorrect code can still serve as a useful starting point for developers. Therefore, understanding the effort required to fix incorrect solutions and the possibility of automated fixes is very important. 3. **How does task complexity affect LLMs' code generation?** Intuitively, complex tasks are harder to solve than simple tasks. However, it is unclear whether different LLMs have varying code generation capabilities when solving tasks of different complexities. Specifically, understanding the upper limit of task complexity that LLMs can gracefully handle can guide or estimate the effort needed for code review, testing, and fixing. 4. **What is the relationship between code length and the types of errors generated by LLMs?** Unlike the third question, this question aims to investigate whether the length of the generated code affects the correctness of the code snippets. It particularly focuses on longer code solutions and their error characteristics. 5. **Do partially failing codes have different characteristics compared to completely failing codes?** This question explores the differences between code that fails some test cases and code that fails all test cases, providing insights into the specific challenges of achieving complete correctness. By answering these questions, the paper hopes to provide insights and improvement suggestions for future LLMs code generation research.