Training Language Models to Self-Correct via Reinforcement Learning

Aviral Kumar,Vincent Zhuang,Rishabh Agarwal,Yi Su,John D Co-Reyes,Avi Singh,Kate Baumli,Shariq Iqbal,Colton Bishop,Rebecca Roelofs,Lei M Zhang,Kay McKinney,Disha Shrivastava,Cosmin Paduraru,George Tucker,Doina Precup,Feryal Behbahani,Aleksandra Faust
2024-10-05
Abstract:Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are often insufficient for instilling self-correction behavior. In particular, we observe that training via SFT falls prey to either a distribution mismatch between mistakes made by the data-collection policy and the model's own responses, or to behavior collapse, where learning implicitly prefers only a certain mode of correction behavior that is often not effective at self-correction on test problems. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time as opposed to fitting high-reward responses for a given prompt. This regularization process includes an initial phase of multi-turn RL on a base model to generate a policy initialization that is less susceptible to collapse, followed by using a reward bonus to amplify self-correction. With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.
Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the issue of large language models (LLMs) lacking self-correction capabilities. Specifically, modern LLMs struggle to effectively detect and correct their own erroneous responses without external input. Although these models typically possess the knowledge required to solve problems, they fail to correctly utilize this knowledge to generate accurate answers. ### Main Objectives of the Paper: 1. **Enhance the self-correction ability of LLMs**: By developing a new multi-round online reinforcement learning method (SCoRe), enabling LLMs to effectively detect and correct their own errors without external supervision. 2. **Address the limitations of existing methods**: Current self-correction methods often rely on multiple models, more advanced models, or additional forms of supervision, which face issues such as distribution mismatch and behavior collapse. SCoRe aims to overcome these limitations by training the model using only automatically generated data. ### Background and Motivation: - **Limitations of existing methods**: Current methods like supervised fine-tuning (SFT) and simple reinforcement learning (RL) face issues of distribution mismatch and behavior collapse when training for self-correction. Distribution mismatch refers to the difference between errors in the training data and errors generated by the model itself, leading to poor training outcomes. Behavior collapse refers to the model's tendency to generate the best first attempt response and make no substantial modifications in subsequent attempts. - **Design of SCoRe**: SCoRe trains directly on the model's own distribution through multi-round online reinforcement learning and uses appropriate regularization techniques to guide the learning process, thus avoiding behavior collapse. Specifically, SCoRe includes two stages: - **First Stage**: Generate an initial policy that can produce high-reward responses in the second attempt while mimicking the base model's response in the first attempt. - **Second Stage**: Jointly optimize both attempts, using shaped rewards to incentivize the discovery of self-correction strategies rather than simply generating the best first response and making minor modifications in the second attempt. ### Experimental Results: - **Performance Improvement**: On the MATH and HumanEval datasets, SCoRe significantly improved the self-correction ability of LLMs, with improvements of 15.6% and 9.1% over the base model, respectively. - **Comparative Experiments**: Compared to existing SFT and standard RL methods, SCoRe performed better in self-correction, especially when dealing with new problems. ### Conclusion: SCoRe is an effective multi-round reinforcement learning method that significantly enhances the self-correction ability of LLMs, enabling them to better detect and correct their own errors without external supervision. This method is expected to play an important role in fields such as scientific computation, mathematical reasoning, and code writing.