Abstract:Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are often insufficient for instilling self-correction behavior. In particular, we observe that training via SFT falls prey to either a distribution mismatch between mistakes made by the data-collection policy and the model's own responses, or to behavior collapse, where learning implicitly prefers only a certain mode of correction behavior that is often not effective at self-correction on test problems. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time as opposed to fitting high-reward responses for a given prompt. This regularization process includes an initial phase of multi-turn RL on a base model to generate a policy initialization that is less susceptible to collapse, followed by using a reward bonus to amplify self-correction. With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.

What problem does this paper attempt to address?

The paper attempts to address the issue of large language models (LLMs) lacking self-correction capabilities. Specifically, modern LLMs struggle to effectively detect and correct their own erroneous responses without external input. Although these models typically possess the knowledge required to solve problems, they fail to correctly utilize this knowledge to generate accurate answers. ### Main Objectives of the Paper: 1. **Enhance the self-correction ability of LLMs**: By developing a new multi-round online reinforcement learning method (SCoRe), enabling LLMs to effectively detect and correct their own errors without external supervision. 2. **Address the limitations of existing methods**: Current self-correction methods often rely on multiple models, more advanced models, or additional forms of supervision, which face issues such as distribution mismatch and behavior collapse. SCoRe aims to overcome these limitations by training the model using only automatically generated data. ### Background and Motivation: - **Limitations of existing methods**: Current methods like supervised fine-tuning (SFT) and simple reinforcement learning (RL) face issues of distribution mismatch and behavior collapse when training for self-correction. Distribution mismatch refers to the difference between errors in the training data and errors generated by the model itself, leading to poor training outcomes. Behavior collapse refers to the model's tendency to generate the best first attempt response and make no substantial modifications in subsequent attempts. - **Design of SCoRe**: SCoRe trains directly on the model's own distribution through multi-round online reinforcement learning and uses appropriate regularization techniques to guide the learning process, thus avoiding behavior collapse. Specifically, SCoRe includes two stages: - **First Stage**: Generate an initial policy that can produce high-reward responses in the second attempt while mimicking the base model's response in the first attempt. - **Second Stage**: Jointly optimize both attempts, using shaped rewards to incentivize the discovery of self-correction strategies rather than simply generating the best first response and making minor modifications in the second attempt. ### Experimental Results: - **Performance Improvement**: On the MATH and HumanEval datasets, SCoRe significantly improved the self-correction ability of LLMs, with improvements of 15.6% and 9.1% over the base model, respectively. - **Comparative Experiments**: Compared to existing SFT and standard RL methods, SCoRe performed better in self-correction, especially when dealing with new problems. ### Conclusion: SCoRe is an effective multi-round reinforcement learning method that significantly enhances the self-correction ability of LLMs, enabling them to better detect and correct their own errors without external supervision. This method is expected to play an important role in fields such as scientific computation, mathematical reasoning, and code writing.

Training Language Models to Self-Correct via Reinforcement Learning

Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies

Self-Correction is More than Refinement: A Learning Framework for Visual and Language Reasoning Tasks

When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs

A Theoretical Understanding of Self-Correction through In-context Alignment

CorrectionLM: Self-Corrections with SLM for Dialogue State Tracking

Confidence Matters: Revisiting Intrinsic Self-Correction Capabilities of Large Language Models

Is Moral Self-correction An Innate Capability of Large Language Models? A Mechanistic Analysis to Self-correction

Small Language Models Need Strong Verifiers to Self-Correct Reasoning

Confidence v.s. Critique: A Decomposition of Self-Correction Capability for LLMs

Large Language Models have Intrinsic Self-Correction Ability

Automatically Correcting Large Language Models: Surveying the Landscape of Diverse Automated Correction Strategies

ProgCo: Program Helps Self-Correction of Large Language Models

Learning to Check: Unleashing Potentials for Self-Correction in Large Language Models

Large Language Models Cannot Self-Correct Reasoning Yet

Recursive Introspection: Teaching Language Model Agents How to Self-Improve

S^3c-Math: Spontaneous Step-level Self-correction Makes Large Language Models Better Mathematical Reasoners

On the Intrinsic Self-Correction Capability of LLMs: Uncertainty and Latent Concept

Language Model Self-improvement by Reinforcement Learning Contemplation

N-Critics: Self-Refinement of Large Language Models with Ensemble of Critics

SuperCorrect: Supervising and Correcting Language Models with Error-Driven Insights