Teaching Large Language Models to Reason with Reinforcement Learning

Alex Havrilla,Yuqing Du,Sharath Chandra Raparthy,Christoforos Nalmpantis,Jane Dwivedi-Yu,Maksym Zhuravinskyi,Eric Hambro,Sainbayar Sukhbaatar,Roberta Raileanu
2024-03-08
Abstract:Reinforcement Learning from Human Feedback (\textbf{RLHF}) has emerged as a dominant approach for aligning LLM outputs with human preferences. Inspired by the success of RLHF, we study the performance of multiple algorithms that learn from feedback (Expert Iteration, Proximal Policy Optimization (\textbf{PPO}), Return-Conditioned RL) on improving LLM reasoning capabilities. We investigate both sparse and dense rewards provided to the LLM both heuristically and via a learned reward model. We additionally start from multiple model sizes and initializations both with and without supervised fine-tuning (\textbf{SFT}) data. Overall, we find all algorithms perform comparably, with Expert Iteration performing best in most cases. Surprisingly, we find the sample complexity of Expert Iteration is similar to that of PPO, requiring at most on the order of $10^6$ samples to converge from a pretrained checkpoint. We investigate why this is the case, concluding that during RL training models fail to explore significantly beyond solutions already produced by SFT models. Additionally, we discuss a trade off between maj@1 and pass@96 metric performance during SFT training and how conversely RL training improves both simultaneously. We then conclude by discussing the implications of our findings for RLHF and the future role of RL in LLM fine-tuning.
Computer Science
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to enhance the reasoning ability of large language models (LLMs) through Reinforcement Learning (RL) techniques. Specifically, the authors explored the performance of multiple Reinforcement Learning algorithms that learn from human feedback (such as Expert Iteration, Proximal Policy Optimization (PPO), Return - Conditioned Reinforcement Learning, etc.) on improving LLMs' reasoning tasks, and studied the influence of different reward mechanisms (sparse rewards and dense rewards) and model initialization methods on the performance of these algorithms. ### Main problem decomposition: 1. **Improving LLM reasoning ability**: - The authors focused on how to use RLHF (Reinforcement Learning from Human Feedback) to enhance the LLM's reasoning ability on problems such as mathematics, science, and programming. 2. **Evaluating the performance of different RL algorithms**: - Compared the performance of multiple RL algorithms (such as Expert Iteration, PPO, Return - Conditioned RL) in reasoning tasks and analyzed their sample complexity. 3. **Exploring different reward mechanisms**: - Studied the effects of sparse rewards and dense rewards, as well as the impact of the learned reward model on the model performance. 4. **The influence of different model initializations**: - Analyzed the performance differences of each RL algorithm when initializing the model from pre - training checkpoints and Supervised Fine - Tuning (SFT) checkpoints. ### Core contributions of the paper: - **Comprehensive research**: Conducted a comprehensive study of PPO fine - tuning under different types of rewards, model sizes, and initialization methods. - **Algorithm comparison**: Compared Expert Iteration and Return - Conditioned RL, and found that Expert Iteration can obtain the best performance and competitive sample complexity in most cases. - **Discussion of future directions**: Discussed the impact of the research results on RLHF and future RL - based LLM fine - tuning, especially pointing out exploration as the main limiting factor. Through these studies, the authors hope to provide valuable insights for future LLM fine - tuning and promote the application of more effective Reinforcement Learning methods.