Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Shusheng Xu,Wei Fu,Jiaxuan Gao,Wenjie Ye,Weilin Liu,Zhiyu Mei,Guangju Wang,Chao Yu,Yi Wu
2024-10-10
Abstract:Reinforcement Learning from Human Feedback (RLHF) is currently the most widely used method to align large language models (LLMs) with human preferences. Existing RLHF methods can be roughly categorized as either reward-based or reward-free. Novel applications such as ChatGPT and Claude leverage reward-based methods that first learn a reward model and apply actor-critic algorithms, such as Proximal Policy Optimization (PPO). However, in academic benchmarks, state-of-the-art results are often achieved via reward-free methods, such as Direct Preference Optimization (DPO). Is DPO truly superior to PPO? Why does PPO perform poorly on these benchmarks? In this paper, we first conduct both theoretical and empirical studies on the algorithmic properties of DPO and show that DPO may have fundamental limitations. Moreover, we also comprehensively examine PPO and reveal the key factors for the best performances of PPO in fine-tuning LLMs. Finally, we benchmark DPO and PPO across a collection of RLHF testbeds, ranging from dialogue to code generation. Experiment results demonstrate that PPO is able to surpass other alignment methods in all cases and achieve state-of-the-art results in challenging code competitions. Our code is publicly available at <a class="link-external link-https" href="https://github.com/openpsi-project/ReaLHF" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to explore whether Direct Preference Optimization (DPO) is superior to Proximal Policy Optimization (PPO) in the field of large language model (LLM) alignment, and why PPO performs poorly in some benchmark tests. Specifically, the paper mainly focuses on the following two questions: 1. **Is DPO really better than PPO?** 2. **Why does PPO perform worse in these benchmark tests?** To solve these problems, the author conducted theoretical and empirical research, analyzed the algorithmic characteristics of DPO and PPO, and revealed the key factors for PPO to achieve the best performance when fine - tuning LLM. Finally, the author verified the performance of these two methods in different tasks through a series of experiments, including dialogue generation and code generation tasks. ### Main Findings - **Theoretical Analysis**: The author points out that DPO may find biased solutions, especially when dealing with out - of - distribution (OOD) data. DPO may take advantage of OOD responses, leading to deviation from the reference strategy. - **Empirical Analysis**: Through experiments on actual preference data sets, the author found that the performance of DPO is affected by the base model and the distribution of preference data. In particular, when the distribution of the preference data set does not match the model output distribution, the performance of DPO will decline significantly. - **Advantages of PPO**: The author found three key factors for PPO's excellent performance in RLHF through ablation experiments: advantage normalization, large batch - size training, and exponential moving average update for the reference model. These techniques enable PPO to outperform other alignment methods in all experiments and reach the state - of - the - art level in the challenging code competition task. ### Conclusion Through theoretical and empirical research, the paper proves the consistent superiority of PPO in various tasks, especially when dealing with complex tasks such as code generation. At the same time, the author also points out the limitations of DPO and proposes suggestions for improving DPO, such as using the iterative DPO method to alleviate the distribution mismatch problem. In summary, this paper not only answers the question of whether DPO is better than PPO, but also deeply explores the advantages and disadvantages of the two methods and provides valuable insights for future research.