Abstract:Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are often insufficient for instilling self-correction behavior. In particular, we observe that training via SFT falls prey to either a distribution mismatch between mistakes made by the data-collection policy and the model's own responses, or to behavior collapse, where learning implicitly prefers only a certain mode of correction behavior that is often not effective at self-correction on test problems. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time as opposed to fitting high-reward responses for a given prompt. This regularization process includes an initial phase of multi-turn RL on a base model to generate a policy initialization that is less susceptible to collapse, followed by using a reward bonus to amplify self-correction. With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.

Internalized Self-Correction for Large Language Models

Large Language Models have Intrinsic Self-Correction Ability

Confidence Matters: Revisiting Intrinsic Self-Correction Capabilities of Large Language Models

Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies

On the Intrinsic Self-Correction Capability of LLMs: Uncertainty and Latent Concept

Small Language Model Can Self-correct

N-Critics: Self-Refinement of Large Language Models with Ensemble of Critics

Large Language Models Cannot Self-Correct Reasoning Yet

Self-Correction is More than Refinement: A Learning Framework for Visual and Language Reasoning Tasks

Training Language Models to Self-Correct via Reinforcement Learning

Smaller Large Language Models Can Do Moral Self-Correction

Is Moral Self-correction An Innate Capability of Large Language Models? A Mechanistic Analysis to Self-correction

Automatically Correcting Large Language Models: Surveying the Landscape of Diverse Automated Correction Strategies

When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs

Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis

Small Language Models Need Strong Verifiers to Self-Correct Reasoning

Internal Consistency and Self-Feedback in Large Language Models: A Survey

CorrectionLM: Self-Corrections with SLM for Dialogue State Tracking

Language Model Self-improvement by Reinforcement Learning Contemplation

Confidence v.s. Critique: A Decomposition of Self-Correction Capability for LLMs

A Theoretical Understanding of Self-Correction through In-context Alignment