LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation

Sarah Fakhoury,Aaditya Naik,Georgios Sakkas,Saikat Chakraborty,Shuvendu K. Lahiri
DOI: https://doi.org/10.1109/TSE.2024.3428972
2024-10-03
Abstract:Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, given NL is informal, it does not lend easily to checking that the generated code correctly satisfies the user intent. In this paper, we propose a novel interactive workflow TiCoder for guided intent clarification (i.e., partial formalization) through tests to support the generation of more accurate code suggestions. Through a mixed methods user study with 15 programmers, we present an empirical evaluation of the effectiveness of the workflow to improve code generation accuracy. We find that participants using the proposed workflow are significantly more likely to correctly evaluate AI generated code, and report significantly less task-induced cognitive load. Furthermore, we test the potential of the workflow at scale with four different state-of-the-art LLMs on two python datasets, using an idealized proxy for a user feedback. We observe an average absolute improvement of 45.97% in the pass@1 code generation accuracy for both datasets and across all LLMs within 5 user interactions, in addition to the automatic generation of accompanying unit tests.
Software Engineering
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges in generating code from natural - language intents when using large - language models (LLMs). Specifically, the ambiguity and informality of natural language make it difficult to verify whether the generated code correctly meets the user's intent. The paper points out that although LLMs show great potential in generating seemingly natural programs based on natural language, the inherent ambiguity of natural language results in the generated code may contain subtle errors that are inconsistent with the original user intent. Moreover, due to the informality of natural language, the user intent cannot be directly enforced through mechanical processes (such as testing, static analysis, or formal verification). To solve these problems, the paper proposes a new interactive workflow - TICODER (Test - Driven Interactive Code Generation), which aims to support the generation of more accurate code suggestions by clarifying (i.e., partially formalizing) the user intent through testing. TICODER works by first clarifying the user intent through automatically generated tests, and then generating a list of code suggestions that are consistent with these tests. This method not only helps to make natural - language intents more precise, but also helps to prune incorrect suggestions generated by LLMs, and serves as an auxiliary tool for debugging the remaining suggestions and regression testing for future code edits. The paper explores the effectiveness of TICODER through mixed - method user studies and large - scale evaluations, aiming to answer two main research questions: 1. How does TICODER affect the performance of Python developers in evaluating AI - generated code, especially in terms of task correctness, time, and cognitive load? 2. Can the TICODER workflow improve the accuracy of generated code suggestions?