Abstract:Recently, numerous preference optimization algorithms have been introduced as extensions to the Direct Preference Optimization (DPO) family. While these methods have successfully aligned models with human preferences, there is a lack of understanding regarding the contributions of their additional components. Moreover, fair and consistent comparisons are scarce, making it difficult to discern which components genuinely enhance downstream performance. In this work, we propose RainbowPO, a unified framework that demystifies the effectiveness of existing DPO methods by categorizing their key components into seven broad directions. We integrate these components into a single cohesive objective, enhancing the performance of each individual element. Through extensive experiments, we demonstrate that RainbowPO outperforms existing DPO variants. Additionally, we provide insights to guide researchers in developing new DPO methods and assist practitioners in their implementations.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are the lack of systematic research on the effectiveness of different improved components in current Direct Preference Optimization (DPO) methods and how these components work together to further improve performance. Specifically: 1. **Lack of understanding of the effectiveness of existing DPO method components**: Although many DPO extension methods have achieved success in specific tasks, the specific contributions of the components they introduce respectively to performance improvement are still unclear. This leads to difficulties in selecting and comparing different DPO variants. 2. **Lack of fair and consistent comparison**: Due to different base model sizes, architectures, alignment datasets, experimental settings, and evaluation metrics, it is very difficult to conduct fair and consistent comparisons among existing DPO variants. 3. **Exploring the synergy of components**: The paper also explores whether these improved components can be complementary and effectively combined to achieve better overall performance. To address these problems, the authors propose a unified framework - RAINBOW PO, which integrates the key components in existing DPO methods and verifies the effectiveness of these components through extensive experiments. RAINBOW PO can not only enhance the effect of each individual component but also further improve performance by reasonably combining multiple components. ### Main contributions 1. **Comprehensive analysis of existing DPO variants**: The paper conducts a detailed analysis of more than 10 offline representative DPO variants, identifies seven main improvement directions, and proves the effectiveness of four of them through theory and experiments. 2. **Identify and summarize effective components**: The paper summarizes seven components that are widely present in DPO extension methods: length normalization, link function, advantage term, reference strategy, context scaling, Rejection Sampling Optimization (RSO), and Supervised Fine - Tuning Loss (SFT loss), and verifies the effectiveness of four of these components through experiments. 3. **Propose RAINBOW PO**: The paper proposes RAINBOW PO, which is a DPO variant that combines three key and orthogonal components. By adjusting the number of training rounds and optimizing hyper - parameters, RAINBOW PO performs excellently when adjusting the Llama3 - 8B - Instruct model. In the Alpaca - Eval benchmark test, the length - control winning rate is increased from 22.92% to 51.66%. ### Experimental results - **Effect of individual components**: The experimental results show that components such as length normalization, mixed reference strategy, and context scaling can significantly improve performance when used alone, while some other components such as the advantage term and link function fail to bring obvious improvements. - **Effect of component combinations**: When combining length normalization with the mixed reference strategy, the performance improvement is particularly significant. In addition, Rejection Sampling Optimization (RSO) can also improve performance to a certain extent. In conclusion, through systematic analysis and experiments, this paper not only clarifies the effectiveness of each component in existing DPO methods but also proposes a new framework, RAINBOW PO, which provides valuable guidance for future research and practice.

RainbowPO: A Unified Framework for Combining Improvements in Preference Optimization

New Desiderata for Direct Preference Optimization

A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

MallowsPO: Fine-Tune Your LLM with Preference Dispersions

Accelerating Direct Preference Optimization with Prefix Sharing

Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts

Preference as Reward, Maximum Preference Optimization with Importance Sampling

WPO: Enhancing RLHF with Weighted Preference Optimization

Policy Optimization in RLHF: The Impact of Out-of-preference Data

SimPO: Simple Preference Optimization with a Reference-Free Reward

Orthogonal Finetuning for Direct Preference Optimization

Minor DPO reject penalty to increase training robustness

Preference Optimization with Multi-Sample Comparisons

Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

Decomposed Direct Preference Optimization for Structure-Based Drug Design

Towards Improved Preference Optimization Pipeline: from Data Generation to Budget-Controlled Regularization

$f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization

Direct Preference Optimization with an Offset

Hybrid Preference Optimization: Augmenting Direct Preference Optimization with Auxiliary Objectives

$α$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs