Towards Generating Functionally Correct Code Edits from Natural Language Issue Descriptions

Sarah Fakhoury,Saikat Chakraborty,Madan Musuvathi,Shuvendu K. Lahiri

2023-04-08

Abstract:Large language models (LLMs), such as OpenAI's Codex, have demonstrated their potential to generate code from natural language descriptions across a wide range of programming tasks. Several benchmarks have recently emerged to evaluate the ability of LLMs to generate functionally correct code from natural language intent with respect to a set of hidden test cases. This has enabled the research community to identify significant and reproducible advancements in LLM capabilities. However, there is currently a lack of benchmark datasets for assessing the ability of LLMs to generate functionally correct code edits based on natural language descriptions of intended changes. This paper aims to address this gap by motivating the problem NL2Fix of translating natural language descriptions of code changes (namely bug fixes described in Issue reports in repositories) into correct code fixes. To this end, we introduce Defects4J-NL2Fix, a dataset of 283 Java programs from the popular Defects4J dataset augmented with high-level descriptions of bug fixes, and empirically evaluate the performance of several state-of-the-art LLMs for the this task. Results show that these LLMS together are capable of generating plausible fixes for 64.6% of the bugs, and the best LLM-based technique can achieve up to 21.20% top-1 and 35.68% top-5 accuracy on this benchmark.

Software Engineering,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the current lack of benchmark datasets for evaluating the ability of large - language models (LLMs) to generate functionally correct code edits according to natural - language descriptions. Specifically, the paper focuses on how to transform code - change requirements described in natural language (such as bug fixes described in issue reports in software repositories) into correct code fixes. The authors propose a problem named nl2fix, that is, the transformation from natural - language descriptions to functionally correct code fixes, and for this purpose, they construct a dataset named Defects4J - Nl2fix. This dataset contains 283 Java programs and their high - level bug - fix descriptions, as well as test suites to ensure the effectiveness of the fixes. Through this dataset, the authors conduct an empirical evaluation of several state - of - the - art LLMs to examine their performance on the nl2fix task. The research results show that these LLMs can generate reasonable fix solutions for 64.6% of the bugs, among which the best LLM technology achieves a top - 1 accuracy of 21.20% and a top - 5 accuracy of 35.68% on this benchmark.

Towards Generating Functionally Correct Code Edits from Natural Language Issue Descriptions

DeepCode AI Fix: Fixing Security Vulnerabilities with Large Language Models

Fixing Code Generation Errors for Large Language Models

Hotfixing Large Language Models for Code

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

Large Language Models of Code Fail at Completing Code with Potential Bugs

Self-Edit: Fault-Aware Code Editor for Code Generation

Fixing Hardware Security Bugs with Large Language Models

An Exploratory Study on Using Large Language Models for Mutation Testing

On Hardware Security Bug Code Fixes By Prompting Large Language Models

Impact of Large Language Models of Code on Fault Localization

The GitHub Recent Bugs Dataset for Evaluating LLM-based Debugging Applications

Model Editing for LLMs4Code: How Far are We?

CompCodeVet: A Compiler-guided Validation and Enhancement Approach for Code Dataset

Understanding Defects in Generated Codes by Language Models

LLM-Assisted Code Cleaning For Training Accurate Code Generators

SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents

Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions

Large Language Models and Simple, Stupid Bugs

An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation