CREF: An LLM-based Conversational Software Repair Framework for Programming Tutors

Boyang Yang,Haoye Tian,Weiguo Pian,Haoran Yu,Haitao Wang,Jacques Klein,Tegawendé F. Bissyandé,Shunfu Jin
2024-07-08
Abstract:Program repair techniques offer cost-saving benefits for debugging within software development and programming education scenarios. With the proven effectiveness of Large Language Models (LLMs) in code-related tasks, researchers have explored their potential for program repair. However, it is crucial to recognize that existing repair benchmarks may have influenced LLM training data, potentially causing data leakage. To evaluate LLMs' realistic repair capabilities, (1) we introduce an extensive, non-crawled benchmark, referred to as TutorCode, comprising 1,239 C++ defect codes and associated information such as tutor guidance, solution description, failing test cases, and the corrected code. Our work assesses the repair performance of 12 LLMs on TutorCode, measuring repair correctness (TOP-5 and AVG-5) and patch precision (RPSR). (2) We then provide a comprehensive investigation into which types of extra information can help LLMs improve their performance in repairing defects. Among these types, tutor guidance was found to be the most effective information in enhancing LLM repair capabilities. To fully harness LLMs' conversational capabilities and the benefits of augmented information, (3) we introduce a novel conversational semi-automatic repair framework CREF assisting human tutor. It demonstrates a remarkable AVG-5 improvement of 17.2%-24.6% compared to the baseline, achieving an impressive AVG-5 of 76.6% when utilizing GPT-4. These results highlight the potential for enhancing LLMs' repair capabilities through interactions with tutors and historical conversations involving incorrect responses. The successful application of CREF in a real-world educational setting demonstrates its effectiveness in reducing tutors' workload and improving students' learning experience, while also showcasing its promise for facilitating other software engineering tasks, such as code review.
Software Engineering
What problem does this paper attempt to address?
The paper attempts to address the following issues: 1. **Data Leakage Issue**: Existing program repair benchmarks may have already been included in the training data of large language models (LLMs), leading to data leakage. This makes it difficult to evaluate the true repair capabilities of LLMs. Therefore, the paper proposes a new, un-crawled benchmark dataset "TutorCode" to ensure fairness and accuracy in evaluation. 2. **Role of Augmented Information**: Investigates how different types of augmented information (such as tutor guidance, solution descriptions, failed test cases, etc.) can help improve the performance of LLMs in program repair. The study finds that providing tutor guidance significantly enhances the repair performance of LLMs, and combining solution descriptions with failed test cases can further improve the results. 3. **Interactive Repair**: Utilizes the conversational abilities of LLMs to enhance their program repair capabilities through interaction with human tutors. To this end, the paper introduces a novel semi-automatic conversational repair framework called "Cref," which leverages multiple rounds of dialogue to fully exploit the repair potential of LLMs. Experimental results show that Cref significantly reduces debugging time and costs in practical applications, improving students' programming learning experience. ### Specific Issues and Methods - **RQ-1**: How effective are state-of-the-art LLMs in repairing faulty code? - Evaluates the repair performance of 12 well-known LLMs using TutorCode (an un-crawled dataset), including repair correctness and patch precision. Additionally, the study explores the impact of code length and programming task difficulty on the repair capabilities of LLMs. - **RQ-2**: Can augmented information enhance the repair capabilities of LLMs? - Uses three types of augmented information (solution descriptions, tutor guidance, failed test cases) to improve the repair capabilities of LLMs and analyzes the impact of different combinations of this information on repair performance. - **RQ-3**: To what extent can dialogue-based repair methods further exploit the repair potential of LLMs? - Introduces the Cref framework, which utilizes multiple rounds of dialogue to leverage the repair capabilities of LLMs. Experiments validate the impact of including or excluding historical dialogue records in each round on repair performance. ### Main Contributions - **TutorCode Benchmark Dataset**: Provides a large-scale, un-crawled C++ repair benchmark dataset containing 1,239 faulty code samples and related information. - **Evaluation of LLMs' Repair Capabilities**: Assesses the repair performance of 12 well-known LLMs on TutorCode. - **Impact of Augmented Information**: Analyzes the enhancement effect of different types of augmented information on the repair performance of LLMs. - **Cref Framework**: Introduces a dialogue-based semi-automatic repair framework that improves the repair capabilities of LLMs through multiple rounds of dialogue. - **Practical Application**: Deploys Cref in a company's programming education scenario, significantly reducing debugging time and costs, and improving students' programming learning experience. Through these studies, the paper not only addresses the data leakage issue in existing evaluation methods but also explores how to improve the performance of LLMs in program repair through augmented information and interactive methods.