Towards Generating Functionally Correct Code Edits from Natural Language Issue Descriptions

Sarah Fakhoury,Saikat Chakraborty,Madan Musuvathi,Shuvendu K. Lahiri
2023-04-08
Abstract:Large language models (LLMs), such as OpenAI's Codex, have demonstrated their potential to generate code from natural language descriptions across a wide range of programming tasks. Several benchmarks have recently emerged to evaluate the ability of LLMs to generate functionally correct code from natural language intent with respect to a set of hidden test cases. This has enabled the research community to identify significant and reproducible advancements in LLM capabilities. However, there is currently a lack of benchmark datasets for assessing the ability of LLMs to generate functionally correct code edits based on natural language descriptions of intended changes. This paper aims to address this gap by motivating the problem NL2Fix of translating natural language descriptions of code changes (namely bug fixes described in Issue reports in repositories) into correct code fixes. To this end, we introduce Defects4J-NL2Fix, a dataset of 283 Java programs from the popular Defects4J dataset augmented with high-level descriptions of bug fixes, and empirically evaluate the performance of several state-of-the-art LLMs for the this task. Results show that these LLMS together are capable of generating plausible fixes for 64.6% of the bugs, and the best LLM-based technique can achieve up to 21.20% top-1 and 35.68% top-5 accuracy on this benchmark.
Software Engineering,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the current lack of benchmark datasets for evaluating the ability of large - language models (LLMs) to generate functionally correct code edits according to natural - language descriptions. Specifically, the paper focuses on how to transform code - change requirements described in natural language (such as bug fixes described in issue reports in software repositories) into correct code fixes. The authors propose a problem named nl2fix, that is, the transformation from natural - language descriptions to functionally correct code fixes, and for this purpose, they construct a dataset named Defects4J - Nl2fix. This dataset contains 283 Java programs and their high - level bug - fix descriptions, as well as test suites to ensure the effectiveness of the fixes. Through this dataset, the authors conduct an empirical evaluation of several state - of - the - art LLMs to examine their performance on the nl2fix task. The research results show that these LLMs can generate reasonable fix solutions for 64.6% of the bugs, among which the best LLM technology achieves a top - 1 accuracy of 21.20% and a top - 5 accuracy of 35.68% on this benchmark.