Abstract:Learning from preference feedback has emerged as an essential step for improving the generation quality and performance of modern language models (LMs). Despite its widespread use, the way preference-based learning is applied varies wildly, with differing data, learning algorithms, and evaluations used, making disentangling the impact of each aspect difficult. In this work, we identify four core aspects of preference-based learning: preference data, learning algorithm, reward model, and policy training prompts, systematically investigate the impact of these components on downstream model performance, and suggest a recipe for strong learning for preference feedback. Our findings indicate that all aspects are important for performance, with better preference data leading to the largest improvements, followed by the choice of learning algorithm, the use of improved reward models, and finally the use of additional unlabeled prompts for policy training. Notably, PPO outperforms DPO by up to 2.5% in math and 1.2% in general domains. High-quality preference data leads to improvements of up to 8% in instruction following and truthfulness. Despite significant gains of up to 5% in mathematical evaluation when scaling up reward models, we surprisingly observe marginal improvements in other categories. We publicly release the code used for training (<a class="link-external link-https" href="https://github.com/hamishivi/EasyLM" rel="external noopener nofollow">this https URL</a>) and evaluating (<a class="link-external link-https" href="https://github.com/allenai/open-instruct" rel="external noopener nofollow">this https URL</a>) our models, along with the models and datasets themselves (<a class="link-external link-https" href="https://huggingface.co/collections/allenai/tulu-v25-suite-66676520fd578080e126f618" rel="external noopener nofollow">this https URL</a>).

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: How to systematically study and understand the impact of each key component in preference - based learning on the performance of downstream models. Specifically, the paper aims to answer the following questions: 1. **Which aspects are the most important for preference - based learning?** - The author identifies four core aspects: preference data, learning algorithms, reward models, and policy training prompts, and systematically studies the impact of these components on the performance of downstream models. 2. **How do the two mainstream preference - based learning algorithms, PPO and DPO, perform under different conditions?** - PPO (Proximal Policy Optimization) and DPO (Direct Preference Optimization) are two different preference - based learning methods. The paper explores their respective advantages and disadvantages by comparing the performance of these two algorithms on the same initial model and training data. 3. **How do different types of preference data (such as human - annotated, web - crawled, synthetically generated, etc.) have different impacts on model performance?** - The study uses preference datasets from multiple sources and compares the performance of these datasets on different evaluation tasks to determine which type of data is most conducive to the improvement of model performance. 4. **What is the impact of the scale of the reward model and the amount of training data on the final model performance?** - The author studies its impact on direct evaluation and downstream task performance by expanding the scale of the reward model and increasing the amount of training data. 5. **How do different types of policy training prompts affect performance in specific domains or overall?** - The paper also studies the differences between prompts optimized for specific tasks (such as mathematical ability) and general prompts, and the specific impact of these prompts on model performance. ### Main Findings - **High - quality preference data is the most important factor**: In particular, synthetic data with fine - grained annotations can significantly improve model performance. - **PPO is superior to DPO**: On multiple evaluation metrics, the performance of PPO is generally better than that of DPO, especially with significant improvements in reasoning, coding, and security. - **Improvements in the reward model have a large impact on some tasks**: Although the improvement of the reward model has limited effects in most evaluations, it has a significant improvement on mathematical reasoning tasks (such as GSM8k). - **Prompts for specific tasks are helpful for performance improvement in specific domains**: For example, using prompts from the GSM training set can significantly improve mathematical reasoning ability, while general prompts have limited improvement on overall performance. In summary, through systematic experiments and analysis, this paper provides a comprehensive understanding of preference - based learning and proposes an optimized proposal for this learning method.

Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

MallowsPO: Fine-Tune Your LLM with Preference Dispersions

3D-Properties: Identifying Challenges in DPO and Charting a Path Forward

WPO: Enhancing RLHF with Weighted Preference Optimization

On the Generalization of Preference Learning with DPO

Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness

Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

Optimizing LLMs with Direct Preferences: A Data Efficiency Perspective

Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model

MPPO: Multi Pair-wise Preference Optimization for LLMs with Arbitrary Negative Samples

Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment

Reward-Augmented Data Enhances Direct Preference Alignment of LLMs

Aligning CodeLLMs with Direct Preference Optimization

A Systematic Examination of Preference Learning through the Lens of Instruction-Following

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

$α$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs

Robust Preference Optimization through Reward Model Distillation

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Plug-and-Play Training Framework for Preference Optimization