Abstract:Reinforcement Learning from Human Feedback (RLHF) is currently the most widely used method to align large language models (LLMs) with human preferences. Existing RLHF methods can be roughly categorized as either reward-based or reward-free. Novel applications such as ChatGPT and Claude leverage reward-based methods that first learn a reward model and apply actor-critic algorithms, such as Proximal Policy Optimization (PPO). However, in academic benchmarks, state-of-the-art results are often achieved via reward-free methods, such as Direct Preference Optimization (DPO). Is DPO truly superior to PPO? Why does PPO perform poorly on these benchmarks? In this paper, we first conduct both theoretical and empirical studies on the algorithmic properties of DPO and show that DPO may have fundamental limitations. Moreover, we also comprehensively examine PPO and reveal the key factors for the best performances of PPO in fine-tuning LLMs. Finally, we benchmark DPO and PPO across a collection of RLHF testbeds, ranging from dialogue to code generation. Experiment results demonstrate that PPO is able to surpass other alignment methods in all cases and achieve state-of-the-art results in challenging code competitions. Our code is publicly available at <a class="link-external link-https" href="https://github.com/openpsi-project/ReaLHF" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to explore whether Direct Preference Optimization (DPO) is superior to Proximal Policy Optimization (PPO) in the field of large language model (LLM) alignment, and why PPO performs poorly in some benchmark tests. Specifically, the paper mainly focuses on the following two questions: 1. **Is DPO really better than PPO?** 2. **Why does PPO perform worse in these benchmark tests?** To solve these problems, the author conducted theoretical and empirical research, analyzed the algorithmic characteristics of DPO and PPO, and revealed the key factors for PPO to achieve the best performance when fine - tuning LLM. Finally, the author verified the performance of these two methods in different tasks through a series of experiments, including dialogue generation and code generation tasks. ### Main Findings - **Theoretical Analysis**: The author points out that DPO may find biased solutions, especially when dealing with out - of - distribution (OOD) data. DPO may take advantage of OOD responses, leading to deviation from the reference strategy. - **Empirical Analysis**: Through experiments on actual preference data sets, the author found that the performance of DPO is affected by the base model and the distribution of preference data. In particular, when the distribution of the preference data set does not match the model output distribution, the performance of DPO will decline significantly. - **Advantages of PPO**: The author found three key factors for PPO's excellent performance in RLHF through ablation experiments: advantage normalization, large batch - size training, and exponential moving average update for the reference model. These techniques enable PPO to outperform other alignment methods in all experiments and reach the state - of - the - art level in the challenging code competition task. ### Conclusion Through theoretical and empirical research, the paper proves the consistent superiority of PPO in various tasks, especially when dealing with complex tasks such as code generation. At the same time, the author also points out the limitations of DPO and proposes suggestions for improving DPO, such as using the iterative DPO method to alleviate the distribution mismatch problem. In summary, this paper not only answers the question of whether DPO is better than PPO, but also deeply explores the advantages and disadvantages of the two methods and provides valuable insights for future research.

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models

Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment

Aligning CodeLLMs with Direct Preference Optimization

DPO Meets PPO: Reinforced Token Optimization for RLHF

Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model

Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts

Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks

$α$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs

Fine-Tuning Language Models with Advantage-Induced Policy Alignment

Bootstrapping Language Models with DPO Implicit Rewards

MPPO: Multi Pair-wise Preference Optimization for LLMs with Arbitrary Negative Samples

Direct Alignment of Language Models via Quality-Aware Self-Refinement

Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization

LiPO: Listwise Preference Optimization through Learning-to-Rank

Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint

Aligning Large Language Models via Fine-grained Supervision

Accelerated Preference Optimization for Large Language Model Alignment

Secrets of RLHF in Large Language Models Part I: PPO