Abstract:The automated program repair field has attracted substantial interest over the years, but despite significant research efforts, creating a system that works well for complex semantic bugs such as security vulnerabilities has proven difficult. A promising direction to solve this challenge is by leveraging large language models (LLMs), which are increasingly used to solve various programming tasks. In this paper, we investigate the effectiveness of LLMs for solving code-repair task. We show that the task is difficult as it requires the model to learn long-range code relationships, a task that inherently relies on extensive amounts of training data. At the same time, creating a large, clean dataset for complex program bugs and their corresponding fixes is non-trivial. We propose a technique to address these challenges with a new approach for querying and fine-tuning LLMs. The idea is to use program analysis to limit the LLM's attention mechanism on the portions of code needed to perform the fix, drastically reducing the amount of required training data. Concretely, for training and inference, rather than feeding the entire program to the LLM, we reduce its code to a much shorter snippet that contains the reported defect together with the necessary context - and use that instead. Our evaluation shows that this code reduction approach substantially improves available models such as GPT-4 using few-shot learning, as well as fine-tuning models. To train and evaluate our system, we created a comprehensive code fixing dataset by extensively labeling 156 bug patterns (including 40 security rules), requiring complex interprocedural dataflow to discover. Our best system with Mixtral-8x7B can remove more than 80% of the reported defects while exactly matching the human fix in between 10 and 50% of cases, outperforming baselines based on GPT-3.5 and GPT-4, or based on window-based models like TFix.

Investigating large language models capabilities for automatic code repair in Python

Repairing Bugs in Python Assignments Using Large Language Models

A Critical Review of Large Language Model on Software Engineering: An Example from ChatGPT and Automated Program Repair

The Future Can’t Help Fix the Past: Assessing Program Repair in the Wild

An Analysis of the Automatic Bug Fixing Performance of ChatGPT

DeepCode AI Fix: Fixing Security Vulnerabilities with Large Language Models

Exploring the Potential of Pre-Trained Language Models of Code for Automated Program Repair

A Novel Approach for Automatic Program Repair using Round-Trip Translation with Large Language Models

On Repairing Quantum Programs Using ChatGPT

Towards python program repair with generative pre-trained transformer (GPT-3.5)

When Large Language Models Confront Repository-Level Automatic Program Repair: How Well They Done?

Conversational Automated Program Repair

Practical Program Repair in the Era of Large Pre-trained Language Models

Keep the Conversation Going: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT

The Right Prompts for the Job: Repair Code-Review Defects with Large Language Model

Revisiting Evolutionary Program Repair via Code Language Model

Peer-aided Repairer: Empowering Large Language Models to Repair Advanced Student Assignments

Debugging with Open-Source Large Language Models: An Evaluation

Extending the Frontier of ChatGPT: Code Generation and Debugging

RePair: Automated Program Repair with Process-based Feedback

ThinkRepair: Self-Directed Automated Program Repair