Abstract:Program repair techniques offer cost-saving benefits for debugging within software development and programming education scenarios. With the proven effectiveness of Large Language Models (LLMs) in code-related tasks, researchers have explored their potential for program repair. However, it is crucial to recognize that existing repair benchmarks may have influenced LLM training data, potentially causing data leakage. To evaluate LLMs' realistic repair capabilities, (1) we introduce an extensive, non-crawled benchmark, referred to as TutorCode, comprising 1,239 C++ defect codes and associated information such as tutor guidance, solution description, failing test cases, and the corrected code. Our work assesses the repair performance of 12 LLMs on TutorCode, measuring repair correctness (TOP-5 and AVG-5) and patch precision (RPSR). (2) We then provide a comprehensive investigation into which types of extra information can help LLMs improve their performance in repairing defects. Among these types, tutor guidance was found to be the most effective information in enhancing LLM repair capabilities. To fully harness LLMs' conversational capabilities and the benefits of augmented information, (3) we introduce a novel conversational semi-automatic repair framework CREF assisting human tutor. It demonstrates a remarkable AVG-5 improvement of 17.2%-24.6% compared to the baseline, achieving an impressive AVG-5 of 76.6% when utilizing GPT-4. These results highlight the potential for enhancing LLMs' repair capabilities through interactions with tutors and historical conversations involving incorrect responses. The successful application of CREF in a real-world educational setting demonstrates its effectiveness in reducing tutors' workload and improving students' learning experience, while also showcasing its promise for facilitating other software engineering tasks, such as code review.

What problem does this paper attempt to address?

The paper attempts to address the following issues: 1. **Data Leakage Issue**: Existing program repair benchmarks may have already been included in the training data of large language models (LLMs), leading to data leakage. This makes it difficult to evaluate the true repair capabilities of LLMs. Therefore, the paper proposes a new, un-crawled benchmark dataset "TutorCode" to ensure fairness and accuracy in evaluation. 2. **Role of Augmented Information**: Investigates how different types of augmented information (such as tutor guidance, solution descriptions, failed test cases, etc.) can help improve the performance of LLMs in program repair. The study finds that providing tutor guidance significantly enhances the repair performance of LLMs, and combining solution descriptions with failed test cases can further improve the results. 3. **Interactive Repair**: Utilizes the conversational abilities of LLMs to enhance their program repair capabilities through interaction with human tutors. To this end, the paper introduces a novel semi-automatic conversational repair framework called "Cref," which leverages multiple rounds of dialogue to fully exploit the repair potential of LLMs. Experimental results show that Cref significantly reduces debugging time and costs in practical applications, improving students' programming learning experience. ### Specific Issues and Methods - **RQ-1**: How effective are state-of-the-art LLMs in repairing faulty code? - Evaluates the repair performance of 12 well-known LLMs using TutorCode (an un-crawled dataset), including repair correctness and patch precision. Additionally, the study explores the impact of code length and programming task difficulty on the repair capabilities of LLMs. - **RQ-2**: Can augmented information enhance the repair capabilities of LLMs? - Uses three types of augmented information (solution descriptions, tutor guidance, failed test cases) to improve the repair capabilities of LLMs and analyzes the impact of different combinations of this information on repair performance. - **RQ-3**: To what extent can dialogue-based repair methods further exploit the repair potential of LLMs? - Introduces the Cref framework, which utilizes multiple rounds of dialogue to leverage the repair capabilities of LLMs. Experiments validate the impact of including or excluding historical dialogue records in each round on repair performance. ### Main Contributions - **TutorCode Benchmark Dataset**: Provides a large-scale, un-crawled C++ repair benchmark dataset containing 1,239 faulty code samples and related information. - **Evaluation of LLMs' Repair Capabilities**: Assesses the repair performance of 12 well-known LLMs on TutorCode. - **Impact of Augmented Information**: Analyzes the enhancement effect of different types of augmented information on the repair performance of LLMs. - **Cref Framework**: Introduces a dialogue-based semi-automatic repair framework that improves the repair capabilities of LLMs through multiple rounds of dialogue. - **Practical Application**: Deploys Cref in a company's programming education scenario, significantly reducing debugging time and costs, and improving students' programming learning experience. Through these studies, the paper not only addresses the data leakage issue in existing evaluation methods but also explores how to improve the performance of LLMs in program repair through augmented information and interactive methods.

CREF: An LLM-based Conversational Software Repair Framework for Programming Tutors

ContrastRepair: Enhancing Conversation-Based Automated Program Repair via Contrastive Test Case Pairs

FastFixer: An Efficient and Effective Approach for Repairing Programming Assignments

Enhancing Code Language Models for Program Repair by Curricular Fine-tuning Framework

The Right Prompts for the Job: Repair Code-Review Defects with Large Language Model

Revisiting Evolutionary Program Repair via Code Language Model

RePair: Automated Program Repair with Process-based Feedback

Conversational Automated Program Repair

ThinkRepair: Self-Directed Automated Program Repair

A Novel Approach for Automatic Program Repair using Round-Trip Translation with Large Language Models

Peer-aided Repairer: Empowering Large Language Models to Repair Advanced Student Assignments

Enhancing the Code Debugging Ability of LLMs via Communicative Agent Based Data Refinement

Benchmarking Educational Program Repair

Multi-Objective Fine-Tuning for Enhanced Program Repair with LLMs

Exploring and Lifting the Robustness of LLM-powered Automated Program Repair with Metamorphic Testing

How Far Can We Go with Practical Function-Level Program Repair?

Keep the Conversation Going: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT

Repairing Bugs in Python Assignments Using Large Language Models

Exploring the Potential of Pre-Trained Language Models of Code for Automated Program Repair

Automated C/C++ Program Repair for High-Level Synthesis via Large Language Models

RepairLLaMA: Efficient Representations and Fine-Tuned Adapters for Program Repair