Abstract:Software engineers are integrating AI assistants into their workflows to enhance productivity and reduce cognitive strain. However, experiences vary significantly, with some engineers finding large language models (LLMs), like ChatGPT, beneficial, while others consider them counterproductive. Researchers also found that ChatGPT's answers included incorrect information. Given the fact that LLMs are still imperfect, it is important to understand how to best incorporate LLMs into the workflow for software engineering (SE) task completion. Therefore, we conducted an observational study with 22 participants using ChatGPT as a coding assistant in a non-trivial SE task to understand the practices, challenges, and opportunities for using LLMs for SE tasks. We identified the cases where ChatGPT failed, their root causes, and the corresponding mitigation solutions used by users. These findings contribute to the overall understanding and strategies for human-AI interaction on SE tasks. Our study also highlights future research and tooling support directions.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to explore the applications and limitations of large - language models (LLMs) in software engineering tasks. Specifically, the researchers conduct empirical research to understand the following points: 1. **Failure cases of LLMs in software engineering tasks**: - The researchers hope to identify the situations in which LLMs (such as ChatGPT) cannot provide correct answers by observing the interactions between users and LLMs, and analyze the root causes of these failures. 2. **How users respond to the failures of LLMs**: - The researchers hope to understand what mitigation measures users take when they encounter incorrect or incomplete answers provided by LLMs, and the effectiveness of these measures. 3. **Users' views on imperfect LLMs**: - The researchers also hope to explore users' perceptions and attitudes when using LLMs to complete software engineering tasks, especially how they view the limitations and shortcomings of LLMs. ### Research background With the rise of artificial intelligence assistants, software engineers are increasingly integrating LLMs (such as ChatGPT) into their work processes to improve productivity and reduce cognitive burden. However, the experiences of different engineers vary significantly. Some engineers think that LLMs are very useful, while others think that they actually reduce work efficiency. In addition, researchers have found that ChatGPT sometimes provides incorrect information, which further highlights the limitations of LLMs. ### Research methods To answer the above research questions, the researchers designed an observational study and invited 22 participants with basic programming experience to use ChatGPT as a coding assistant to complete a complex software engineering task. The specific steps are as follows: 1. **Task design**: - A web application development task with multiple subtasks was designed. The task difficulty gradually increased, from basic HTML to CSS styles, JavaScript dynamic effects, and finally to website deployment. 2. **Recruitment of participants**: - Through university mailing lists and convenience sampling methods, 25 post - secondary education students were recruited. These students had programming experience but lacked web application development experience. 3. **Research protocol**: - Participants conducted virtual experiments via Zoom, using Google Chrome to display the target web page and ChatGPT, and VSCode to display the code base and update it in real time. The researchers recorded audio and screen operations. During the task completion process, participants were required to "think aloud" so that the researchers could better understand their thinking processes. 4. **Data analysis**: - An open - coding method was used to label the data to identify failure cases, root causes, and mitigation strategies. Through the analysis of 46 failure cases, the researchers summarized nine main failure types and 12 root causes. ### Main findings 1. **Failure types**: - The researchers identified nine main failure types, including incomplete answers, verbose answers, lack of pre - conditions, being too complex, lack of context, irrelevant answers, wrong answers, inaccurate image generation, and no response to prompt changes. 2. **Failure causes**: - The causes of failure can be attributed to user - related reasons and ChatGPT - related reasons. User - related reasons include lack of details in questions, overly complex tasks, and failure to read answers carefully. ChatGPT - related reasons include generating incomplete answers, being verbose and lacking explanations, lack of pre - conditions, being too complex, lack of context, misinterpreting user intentions, generating wrong codes, and inaccurate image generation. 3. **Mitigation strategies**: - Participants adopted a variety of mitigation strategies, including updating prompts, decomposing tasks, consulting external resources, etc. These strategies helped users overcome the limitations of LLMs to a certain extent. ### Conclusion This study, through empirical analysis, reveals the limitations of LLMs in software engineering tasks and the behavior patterns of users in dealing with these limitations. The research results are not only helpful for understanding the performance and limitations of LLMs, but also provide valuable references for the design of future artificial intelligence systems.

LLMs are Imperfect, Then What? An Empirical Study on LLM Failures in Software Engineering

Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study

Why and When LLM-Based Assistants Can Go Wrong: Investigating the Effectiveness of Prompt-Based Interactions for Software Help-Seeking

An Empirical Study on Challenges for LLM Application Developers

"Which LLM should I use?": Evaluating LLMs for tasks performed by Undergraduate Computer Science Students

When Young Scholars Cooperate with LLMs in Academic Tasks: The Influence of Individual Differences and Task Complexities

State of Practice: LLMs in Software Engineering and Software Architecture

An Empirical Study on Usage and Perceptions of LLMs in a Software Engineering Project

From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future

LLM-based Smart Reply (LSR): Enhancing Collaborative Performance with ChatGPT-mediated Smart Reply System

Exploring the Evidence-Based Beliefs and Behaviors of LLM-Based Programming Assistants

Navigating the Pitfalls: Analyzing the Behavior of LLMs as a Coding Assistant for Computer Science Students—A Systematic Review of the Literature

Evaluating the Impact of Advanced LLM Techniques on AI-Lecture Tutors for a Robotics Course

LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement

A Critical Review of Large Language Model on Software Engineering: An Example from ChatGPT and Automated Program Repair

Analyzing the Energy and Accuracy of LLMs in Software Development

An Empirical Study on the Potential of LLMs in Automated Software Refactoring

Towards Evaluation Guidelines for Empirical Studies involving LLMs

Understanding the Human-LLM Dynamic: A Literature Survey of LLM Use in Programming Tasks

Impact of Guidance and Interaction Strategies for LLM Use on Learner Performance and Perception

ChatGPT vs LLaMA: Impact, Reliability, and Challenges in Stack Overflow Discussions