LLMs are Imperfect, Then What? An Empirical Study on LLM Failures in Software Engineering

Jiessie Tie,Bingsheng Yao,Tianshi Li,Syed Ishtiaque Ahmed,Dakuo Wang,Shurui Zhou
2024-11-15
Abstract:Software engineers are integrating AI assistants into their workflows to enhance productivity and reduce cognitive strain. However, experiences vary significantly, with some engineers finding large language models (LLMs), like ChatGPT, beneficial, while others consider them counterproductive. Researchers also found that ChatGPT's answers included incorrect information. Given the fact that LLMs are still imperfect, it is important to understand how to best incorporate LLMs into the workflow for software engineering (SE) task completion. Therefore, we conducted an observational study with 22 participants using ChatGPT as a coding assistant in a non-trivial SE task to understand the practices, challenges, and opportunities for using LLMs for SE tasks. We identified the cases where ChatGPT failed, their root causes, and the corresponding mitigation solutions used by users. These findings contribute to the overall understanding and strategies for human-AI interaction on SE tasks. Our study also highlights future research and tooling support directions.
Software Engineering
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to explore the applications and limitations of large - language models (LLMs) in software engineering tasks. Specifically, the researchers conduct empirical research to understand the following points: 1. **Failure cases of LLMs in software engineering tasks**: - The researchers hope to identify the situations in which LLMs (such as ChatGPT) cannot provide correct answers by observing the interactions between users and LLMs, and analyze the root causes of these failures. 2. **How users respond to the failures of LLMs**: - The researchers hope to understand what mitigation measures users take when they encounter incorrect or incomplete answers provided by LLMs, and the effectiveness of these measures. 3. **Users' views on imperfect LLMs**: - The researchers also hope to explore users' perceptions and attitudes when using LLMs to complete software engineering tasks, especially how they view the limitations and shortcomings of LLMs. ### Research background With the rise of artificial intelligence assistants, software engineers are increasingly integrating LLMs (such as ChatGPT) into their work processes to improve productivity and reduce cognitive burden. However, the experiences of different engineers vary significantly. Some engineers think that LLMs are very useful, while others think that they actually reduce work efficiency. In addition, researchers have found that ChatGPT sometimes provides incorrect information, which further highlights the limitations of LLMs. ### Research methods To answer the above research questions, the researchers designed an observational study and invited 22 participants with basic programming experience to use ChatGPT as a coding assistant to complete a complex software engineering task. The specific steps are as follows: 1. **Task design**: - A web application development task with multiple subtasks was designed. The task difficulty gradually increased, from basic HTML to CSS styles, JavaScript dynamic effects, and finally to website deployment. 2. **Recruitment of participants**: - Through university mailing lists and convenience sampling methods, 25 post - secondary education students were recruited. These students had programming experience but lacked web application development experience. 3. **Research protocol**: - Participants conducted virtual experiments via Zoom, using Google Chrome to display the target web page and ChatGPT, and VSCode to display the code base and update it in real time. The researchers recorded audio and screen operations. During the task completion process, participants were required to "think aloud" so that the researchers could better understand their thinking processes. 4. **Data analysis**: - An open - coding method was used to label the data to identify failure cases, root causes, and mitigation strategies. Through the analysis of 46 failure cases, the researchers summarized nine main failure types and 12 root causes. ### Main findings 1. **Failure types**: - The researchers identified nine main failure types, including incomplete answers, verbose answers, lack of pre - conditions, being too complex, lack of context, irrelevant answers, wrong answers, inaccurate image generation, and no response to prompt changes. 2. **Failure causes**: - The causes of failure can be attributed to user - related reasons and ChatGPT - related reasons. User - related reasons include lack of details in questions, overly complex tasks, and failure to read answers carefully. ChatGPT - related reasons include generating incomplete answers, being verbose and lacking explanations, lack of pre - conditions, being too complex, lack of context, misinterpreting user intentions, generating wrong codes, and inaccurate image generation. 3. **Mitigation strategies**: - Participants adopted a variety of mitigation strategies, including updating prompts, decomposing tasks, consulting external resources, etc. These strategies helped users overcome the limitations of LLMs to a certain extent. ### Conclusion This study, through empirical analysis, reveals the limitations of LLMs in software engineering tasks and the behavior patterns of users in dealing with these limitations. The research results are not only helpful for understanding the performance and limitations of LLMs, but also provide valuable references for the design of future artificial intelligence systems.