Abstract:This paper introduces Patched Round-Trip Correctness (Patched RTC), a novel evaluation technique for Large Language Models (LLMs) applied to diverse software development tasks, particularly focusing on "outer loop" activities such as bug fixing, code review, and documentation updates. Patched RTC extends the original Round-Trip Correctness method to work with any LLM and downstream task, offering a self-evaluating framework that measures consistency and robustness of model responses without human intervention. The study demonstrates a correlation between Patched RTC scores and task-specific accuracy metrics, presenting it as an alternative to the LLM-as-Judge paradigm for open-domain task evaluation. We implement Patched RTC in an open-source framework called patchwork, allowing for transparent evaluation during inference across various patchflows. Experiments comparing GPT-3.5 and GPT-4 models across different software development tasks reveal that Patched RTC effectively distinguishes model performance and task difficulty. The paper also explores the impact of consistency prompts on improving model accuracy, suggesting that Patched RTC can guide prompt refinement and model selection for complex software development workflows.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to effectively evaluate the performance of large language models (LLMs) in outer - loop tasks in software development, especially those complex tasks involving code repair, code review, and document update. Existing evaluation methods mainly focus on inner - loop tasks, such as code generation, summarization, and unit testing, while less attention is paid to outer - loop tasks. ### Specific problems include: 1. **Lack of effective evaluation methods**: Although existing benchmarks (such as HumanEval and MBPP) can be run automatically without human intervention, they cannot fully capture complex scenarios in the real world, especially in terms of outer - loop tasks. 2. **Need to reduce dependence on human annotation**: Traditional evaluation methods usually rely on human annotation or inspection, which is not only time - consuming but also costly. Therefore, a method that can evaluate model performance without human intervention is required. 3. **Improve the consistency and robustness of model responses**: In order to improve the performance of the model in outer - loop tasks, a method that can measure the consistency and robustness of model responses and can guide the optimization of prompts and model selection is needed. ### Solutions proposed in the paper: The paper introduces **Patched Round - Trip Correctness (Patched RTC)**, a new evaluation technique aimed at solving the above problems. Patched RTC measures the consistency and robustness of model responses through a self - evaluation framework without human intervention. Its working principle is: - **Round - Trip test**: Given a user query \(Q\), the model generates a response \(R\). Then, based on \(Q\) and \(R\), the model generates a new query \(Q_1\), and then generates a new response \(R_1\) according to \(Q_1\). Finally, the similarity score between \(R\) and \(R_1\) is calculated to determine whether the model response is correct. - **Applicable to multiple tasks**: Patched RTC is not only applicable to code - related tasks, but can also be applied to a wide range of outer - loop development tasks, such as code repair, pull - request review, and document update. - **Relevance to existing benchmarks**: Experiments show that the scores of Patched RTC have a high correlation with existing benchmarks (such as Arena - Hard - Auto), proving its effectiveness. In this way, Patched RTC provides an automated evaluation method without human annotation, which is especially suitable for complex outer - loop tasks, thus accelerating the automation process of these tasks in software development.

Patched RTC: evaluating LLMs for diverse software development tasks

On Reliability of Patch Correctness Assessment.

Unsupervised Evaluation of Code LLMs with Round-Trip Correctness

Patched MOA: optimizing inference for diverse software development tasks

A Case Study of LLM for Automated Vulnerability Repair: Assessing Impact of Reasoning and Patch Validation Feedback

Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models

PATCH! Psychometrics-AssisTed benCHmarking of Large Language Models: A Case Study of Proficiency in 8th Grade Mathematics

Beyond ChatGPT: Enhancing Software Quality Assurance Tasks with Diverse LLMs and Validation Techniques

APPT: Boosting Automated Patch Correctness Prediction via Fine-tuning Pre-trained Models

Automated Patch Assessment for Program Repair at Scale

Attention: Not Just Another Dataset for Patch-Correctness Checking

A Large-Scale Empirical Review of Patch Correctness Checking Approaches

When Automated Program Repair Meets Regression Testing -- An Extensive Study on 2 Million Patches

Automating Patch Set Generation from Code Review Comments Using Large Language Models

When Large Language Models Confront Repository-Level Automatic Program Repair: How Well They Done?

Conversational Automated Program Repair

Practical Program Repair in the Era of Large Pre-trained Language Models

A Novel Approach for Automatic Program Repair using Round-Trip Translation with Large Language Models

Enhancing Automated Program Repair with Solution Design

Better patching using LLM prompting, via Self-Consistency