A General Theoretical Paradigm to Understand Learning from Human Preferences

Mohammad Gheshlaghi Azar,Mark Rowland,Bilal Piot,Daniel Guo,Daniele Calandriello,Michal Valko,Rémi Munos
2023-11-22
Abstract:The prevalent deployment of learning from human preferences through reinforcement learning (RLHF) relies on two important approximations: the first assumes that pairwise preferences can be substituted with pointwise rewards. The second assumes that a reward model trained on these pointwise rewards can generalize from collected data to out-of-distribution data sampled by the policy. Recently, Direct Preference Optimisation (DPO) has been proposed as an approach that bypasses the second approximation and learn directly a policy from collected data without the reward modelling stage. However, this method still heavily relies on the first approximation. In this paper we try to gain a deeper theoretical understanding of these practical algorithms. In particular we derive a new general objective called $\Psi$PO for learning from human preferences that is expressed in terms of pairwise preferences and therefore bypasses both approximations. This new general objective allows us to perform an in-depth analysis of the behavior of RLHF and DPO (as special cases of $\Psi$PO) and to identify their potential pitfalls. We then consider another special case for $\Psi$PO by setting $\Psi$ simply to Identity, for which we can derive an efficient optimisation procedure, prove performance guarantees and demonstrate its empirical superiority to DPO on some illustrative examples.
Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
This paper primarily focuses on addressing the issue of learning from human preferences and proposes new insights and solutions to the theoretical and practical problems present in the two main existing methods—Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimisation (DPO). ### Main Problems Addressed 1. **Insufficient Theoretical Foundation**: Despite the success of RLHF and DPO in practice, their theoretical foundations are not yet fully developed, particularly the assumptions and approximation methods used when dealing with preference data, which may lead to potential issues. 2. **Overfitting Problem**: The DPO method is prone to overfitting when preference data is close to deterministic (i.e., preferences almost always favor a certain option) because it overly relies on the assumption of converting pairwise preferences into point rewards. 3. **Lack of Effective Regularization**: When preference data is very certain, the KL regularization term in DPO becomes ineffective, leading to a final policy that may deviate significantly from the reference policy. ### New Method Proposed The paper introduces a universal objective function framework called Ψ-Preference Optimisation (ΨPO), which can maximize preference probabilities in a non-linear manner and balance the distance between the policy and the reference policy by adjusting parameters. This framework not only unifies RLHF and DPO but also overcomes the limitations of these methods. - **ΨPO Objective Function**: By introducing an arbitrary non-decreasing mapping Ψ, this objective function can be expressed as a function of pairwise preferences, thus avoiding the two key approximations present in RLHF and DPO. - **Identity-PO (IPO)**: As a special case of ΨPO, when Ψ is set to the identity mapping, IPO not only avoids the overfitting problem but also achieves effective regularization by controlling the gap between the policy and the reference policy. ### Theoretical Analysis and Experimental Validation - **Theoretical Analysis**: The paper reveals the potential issues of RLHF and DPO when dealing with deterministic preference data through theoretical analysis and proves that the IPO method can effectively resolve these issues. - **Experimental Validation**: Simple experimental examples are used to demonstrate the overfitting problem of DPO when dealing with deterministic preferences, and by comparing the behavior of IPO with DPO under different circumstances, it is validated that IPO can better maintain proximity to the reference policy. In summary, this paper aims to provide a more robust and flexible solution to the problem of learning from human preferences by proposing the universal ΨPO framework and the specific IPO method, and it validates their effectiveness both theoretically and empirically.