Human Alignment of Large Language Models through Online Preference Optimisation

Daniele Calandriello,Daniel Guo,Remi Munos,Mark Rowland,Yunhao Tang,Bernardo Avila Pires,Pierre Harvey Richemond,Charline Le Lan,Michal Valko,Tianqi Liu,Rishabh Joshi,Zeyu Zheng,Bilal Piot
2024-03-13
Abstract:Ensuring alignment of language models' outputs with human preferences is critical to guarantee a useful, safe, and pleasant user experience. Thus, human alignment has been extensively studied recently and several methods such as Reinforcement Learning from Human Feedback (RLHF), Direct Policy Optimisation (DPO) and Sequence Likelihood Calibration (SLiC) have emerged. In this paper, our contribution is two-fold. First, we show the equivalence between two recent alignment methods, namely Identity Policy Optimisation (IPO) and Nash Mirror Descent (Nash-MD). Second, we introduce a generalisation of IPO, named IPO-MD, that leverages the regularised sampling approach proposed by Nash-MD. This equivalence may seem surprising at first sight, since IPO is an offline method whereas Nash-MD is an online method using a preference model. However, this equivalence can be proven when we consider the online version of IPO, that is when both generations are sampled by the online policy and annotated by a trained preference model. Optimising the IPO loss with such a stream of data becomes then equivalent to finding the Nash equilibrium of the preference model through self-play. Building on this equivalence, we introduce the IPO-MD algorithm that generates data with a mixture policy (between the online and reference policy) similarly as the general Nash-MD algorithm. We compare online-IPO and IPO-MD to different online versions of existing losses on preference data such as DPO and SLiC on a summarisation task.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to ensure that the outputs of large - language models are consistent with human preferences, thereby providing a useful, safe and pleasant user experience. Specifically, the author explores the equivalence of two recently proposed alignment methods - Identity Policy Optimisation (IPO) and Nash Mirror Descent (Nash - MD), and proposes a new class of preference optimization algorithms based on this theoretical bridge. ### Main Contributions 1. **Proof of Equivalence**: The author proves that IPO (an offline method) and Nash - MD (an online method) are equivalent under certain conditions. When considering the online version of IPO, that is, the generated data is generated by an online policy and labeled by a trained preference model, optimizing the IPO loss is equivalent to finding the Nash equilibrium of the preference model through self - play. 2. **Introduction of New Algorithms**: Based on the above equivalence, the author proposes the IPO - MD algorithm, which generates data by using a mixed strategy (a geometric mixture between an online strategy and a reference strategy), similar to the Nash - MD algorithm. In addition, Online IPO, an online variant of IPO, is also proposed. 3. **Experimental Comparison**: The author compares the online versions of Online - IPO, IPO - MD and other existing methods (such as DPO and SLiC) on the abstract generation task, providing detailed experimental results and analysis. ### Core Issues - **Human Alignment**: Ensure that the behavior of the language model conforms to human preferences, especially when dealing with natural - language - generation tasks. - **Combination of Online and Offline Methods**: Explore how to combine the advantages of offline methods (such as stability) with the advantages of online methods (such as real - time adaptability) to improve model performance. - **Application of Nash Equilibrium**: Optimize the model by finding the Nash equilibrium under preference probabilities, making the model more robust when facing different preferences. ### Formula Representation To understand these methods more clearly, the following are several key formulas: - **IPO Loss Function**: \[ L_{\text{IPO}}=\mathbb{E}_{Y, Y' \sim \mu}\left[\left(\log \frac{\pi(Y^{+})\pi_{\text{ref}}(Y^{-})}{\pi(Y^{-})\pi_{\text{ref}}(Y^{+})}-\tau^{-1 / 2}\right)^{2}\right] \] - **Nash - MD - PG Update Rule**: \[ \nabla \log \pi(y)\left(p(y \succ y')-\frac{1}{2}-\tau \log \frac{\pi(y)}{\pi_{\text{ref}}(y)}\right) \] - **IPO - MD Loss Function**: \[ L_{\text{IPO - MD}}=\mathbb{E}_{Y, Y' \sim \text{SG}[\pi^{1-\beta}(\pi_{\text{ref}})^{\beta}]}\left[\left(\log \frac{\pi(Y^{+})\pi_{\text{ref}}(Y^{-})}{\pi(Y^{-})\pi_{\text{ref}}(Y^{+})}-\tau^{-1 / 2}\right)^{2}\right] \] Through these formulas, it can be seen how the author realizes the transition from offline to online methods through mathematical derivation and experimental verification, and finally proposes new optimization algorithms.