Abstract:Reinforcement Learning from Human Feedback (\textbf{RLHF}) has emerged as a dominant approach for aligning LLM outputs with human preferences. Inspired by the success of RLHF, we study the performance of multiple algorithms that learn from feedback (Expert Iteration, Proximal Policy Optimization (\textbf{PPO}), Return-Conditioned RL) on improving LLM reasoning capabilities. We investigate both sparse and dense rewards provided to the LLM both heuristically and via a learned reward model. We additionally start from multiple model sizes and initializations both with and without supervised fine-tuning (\textbf{SFT}) data. Overall, we find all algorithms perform comparably, with Expert Iteration performing best in most cases. Surprisingly, we find the sample complexity of Expert Iteration is similar to that of PPO, requiring at most on the order of $10^6$ samples to converge from a pretrained checkpoint. We investigate why this is the case, concluding that during RL training models fail to explore significantly beyond solutions already produced by SFT models. Additionally, we discuss a trade off between maj@1 and pass@96 metric performance during SFT training and how conversely RL training improves both simultaneously. We then conclude by discussing the implications of our findings for RLHF and the future role of RL in LLM fine-tuning.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to enhance the reasoning ability of large language models (LLMs) through Reinforcement Learning (RL) techniques. Specifically, the authors explored the performance of multiple Reinforcement Learning algorithms that learn from human feedback (such as Expert Iteration, Proximal Policy Optimization (PPO), Return - Conditioned Reinforcement Learning, etc.) on improving LLMs' reasoning tasks, and studied the influence of different reward mechanisms (sparse rewards and dense rewards) and model initialization methods on the performance of these algorithms. ### Main problem decomposition: 1. **Improving LLM reasoning ability**: - The authors focused on how to use RLHF (Reinforcement Learning from Human Feedback) to enhance the LLM's reasoning ability on problems such as mathematics, science, and programming. 2. **Evaluating the performance of different RL algorithms**: - Compared the performance of multiple RL algorithms (such as Expert Iteration, PPO, Return - Conditioned RL) in reasoning tasks and analyzed their sample complexity. 3. **Exploring different reward mechanisms**: - Studied the effects of sparse rewards and dense rewards, as well as the impact of the learned reward model on the model performance. 4. **The influence of different model initializations**: - Analyzed the performance differences of each RL algorithm when initializing the model from pre - training checkpoints and Supervised Fine - Tuning (SFT) checkpoints. ### Core contributions of the paper: - **Comprehensive research**: Conducted a comprehensive study of PPO fine - tuning under different types of rewards, model sizes, and initialization methods. - **Algorithm comparison**: Compared Expert Iteration and Return - Conditioned RL, and found that Expert Iteration can obtain the best performance and competitive sample complexity in most cases. - **Discussion of future directions**: Discussed the impact of the research results on RLHF and future RL - based LLM fine - tuning, especially pointing out exploration as the main limiting factor. Through these studies, the authors hope to provide valuable insights for future LLM fine - tuning and promote the application of more effective Reinforcement Learning methods.

Teaching Large Language Models to Reason with Reinforcement Learning

RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs

RLSF: Reinforcement Learning via Symbolic Feedback

Does RLHF Scale? Exploring the Impacts From Data, Model, and Method

On Designing Effective RL Reward at Training Time for LLM Reasoning

Let's Reinforce Step by Step

On the Modeling Capabilities of Large Language Models for Sequential Decision Making

Multi-turn Reinforcement Learning from Preference Human Feedback

Exploring RL-based LLM Training for Formal Language Tasks with Programmed Rewards

Taming Overconfidence in LLMs: Reward Calibration in RLHF

Reinforcement Learning Enhanced LLMs: A Survey

Pedagogical Alignment of Large Language Models

Secrets of RLHF in Large Language Models Part II: Reward Modeling

Dual Active Learning for Reinforcement Learning from Human Feedback

Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble

Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

Secrets of RLHF in Large Language Models Part I: PPO

Fine-tuning Language Models with Generative Adversarial Feedback

Extracting Heuristics from Large Language Models for Reward Shaping in Reinforcement Learning

Data-Efficient Alignment of Large Language Models with Human Feedback Through Natural Language