Abstract:The prevalent deployment of learning from human preferences through reinforcement learning (RLHF) relies on two important approximations: the first assumes that pairwise preferences can be substituted with pointwise rewards. The second assumes that a reward model trained on these pointwise rewards can generalize from collected data to out-of-distribution data sampled by the policy. Recently, Direct Preference Optimisation (DPO) has been proposed as an approach that bypasses the second approximation and learn directly a policy from collected data without the reward modelling stage. However, this method still heavily relies on the first approximation. In this paper we try to gain a deeper theoretical understanding of these practical algorithms. In particular we derive a new general objective called $\Psi$PO for learning from human preferences that is expressed in terms of pairwise preferences and therefore bypasses both approximations. This new general objective allows us to perform an in-depth analysis of the behavior of RLHF and DPO (as special cases of $\Psi$PO) and to identify their potential pitfalls. We then consider another special case for $\Psi$PO by setting $\Psi$ simply to Identity, for which we can derive an efficient optimisation procedure, prove performance guarantees and demonstrate its empirical superiority to DPO on some illustrative examples.

What problem does this paper attempt to address?

This paper primarily focuses on addressing the issue of learning from human preferences and proposes new insights and solutions to the theoretical and practical problems present in the two main existing methods—Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimisation (DPO). ### Main Problems Addressed 1. **Insufficient Theoretical Foundation**: Despite the success of RLHF and DPO in practice, their theoretical foundations are not yet fully developed, particularly the assumptions and approximation methods used when dealing with preference data, which may lead to potential issues. 2. **Overfitting Problem**: The DPO method is prone to overfitting when preference data is close to deterministic (i.e., preferences almost always favor a certain option) because it overly relies on the assumption of converting pairwise preferences into point rewards. 3. **Lack of Effective Regularization**: When preference data is very certain, the KL regularization term in DPO becomes ineffective, leading to a final policy that may deviate significantly from the reference policy. ### New Method Proposed The paper introduces a universal objective function framework called Ψ-Preference Optimisation (ΨPO), which can maximize preference probabilities in a non-linear manner and balance the distance between the policy and the reference policy by adjusting parameters. This framework not only unifies RLHF and DPO but also overcomes the limitations of these methods. - **ΨPO Objective Function**: By introducing an arbitrary non-decreasing mapping Ψ, this objective function can be expressed as a function of pairwise preferences, thus avoiding the two key approximations present in RLHF and DPO. - **Identity-PO (IPO)**: As a special case of ΨPO, when Ψ is set to the identity mapping, IPO not only avoids the overfitting problem but also achieves effective regularization by controlling the gap between the policy and the reference policy. ### Theoretical Analysis and Experimental Validation - **Theoretical Analysis**: The paper reveals the potential issues of RLHF and DPO when dealing with deterministic preference data through theoretical analysis and proves that the IPO method can effectively resolve these issues. - **Experimental Validation**: Simple experimental examples are used to demonstrate the overfitting problem of DPO when dealing with deterministic preferences, and by comparing the behavior of IPO with DPO under different circumstances, it is validated that IPO can better maintain proximity to the reference policy. In summary, this paper aims to provide a more robust and flexible solution to the problem of learning from human preferences by proposing the universal ΨPO framework and the specific IPO method, and it validates their effectiveness both theoretically and empirically.

A General Theoretical Paradigm to Understand Learning from Human Preferences

Beyond Reward: Offline Preference-guided Policy Optimization

Reward Model Learning vs. Direct Policy Optimization: A Comparative Analysis of Learning from Human Preferences

Pareto-Optimal Learning from Preferences with Hidden Context

Direct Preference Optimization With Unobserved Preference Heterogeneity

Uncertainty-Penalized Direct Preference Optimization

On the Generalization of Preference Learning with DPO

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function

Policy Optimization in RLHF: The Impact of Out-of-preference Data

On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization

New Desiderata for Direct Preference Optimization

Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint

Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation.

Contrastive Preference Learning: Learning from Human Feedback without RL

Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF

$α$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs

Self-Improving Robust Preference Optimization

Direct Preference-based Policy Optimization without Reward Modeling

Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback