Abstract:Ensuring alignment of language models' outputs with human preferences is critical to guarantee a useful, safe, and pleasant user experience. Thus, human alignment has been extensively studied recently and several methods such as Reinforcement Learning from Human Feedback (RLHF), Direct Policy Optimisation (DPO) and Sequence Likelihood Calibration (SLiC) have emerged. In this paper, our contribution is two-fold. First, we show the equivalence between two recent alignment methods, namely Identity Policy Optimisation (IPO) and Nash Mirror Descent (Nash-MD). Second, we introduce a generalisation of IPO, named IPO-MD, that leverages the regularised sampling approach proposed by Nash-MD. This equivalence may seem surprising at first sight, since IPO is an offline method whereas Nash-MD is an online method using a preference model. However, this equivalence can be proven when we consider the online version of IPO, that is when both generations are sampled by the online policy and annotated by a trained preference model. Optimising the IPO loss with such a stream of data becomes then equivalent to finding the Nash equilibrium of the preference model through self-play. Building on this equivalence, we introduce the IPO-MD algorithm that generates data with a mixture policy (between the online and reference policy) similarly as the general Nash-MD algorithm. We compare online-IPO and IPO-MD to different online versions of existing losses on preference data such as DPO and SLiC on a summarisation task.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to ensure that the outputs of large - language models are consistent with human preferences, thereby providing a useful, safe and pleasant user experience. Specifically, the author explores the equivalence of two recently proposed alignment methods - Identity Policy Optimisation (IPO) and Nash Mirror Descent (Nash - MD), and proposes a new class of preference optimization algorithms based on this theoretical bridge. ### Main Contributions 1. **Proof of Equivalence**: The author proves that IPO (an offline method) and Nash - MD (an online method) are equivalent under certain conditions. When considering the online version of IPO, that is, the generated data is generated by an online policy and labeled by a trained preference model, optimizing the IPO loss is equivalent to finding the Nash equilibrium of the preference model through self - play. 2. **Introduction of New Algorithms**: Based on the above equivalence, the author proposes the IPO - MD algorithm, which generates data by using a mixed strategy (a geometric mixture between an online strategy and a reference strategy), similar to the Nash - MD algorithm. In addition, Online IPO, an online variant of IPO, is also proposed. 3. **Experimental Comparison**: The author compares the online versions of Online - IPO, IPO - MD and other existing methods (such as DPO and SLiC) on the abstract generation task, providing detailed experimental results and analysis. ### Core Issues - **Human Alignment**: Ensure that the behavior of the language model conforms to human preferences, especially when dealing with natural - language - generation tasks. - **Combination of Online and Offline Methods**: Explore how to combine the advantages of offline methods (such as stability) with the advantages of online methods (such as real - time adaptability) to improve model performance. - **Application of Nash Equilibrium**: Optimize the model by finding the Nash equilibrium under preference probabilities, making the model more robust when facing different preferences. ### Formula Representation To understand these methods more clearly, the following are several key formulas: - **IPO Loss Function**: \[ L_{\text{IPO}}=\mathbb{E}_{Y, Y' \sim \mu}\left[\left(\log \frac{\pi(Y^{+})\pi_{\text{ref}}(Y^{-})}{\pi(Y^{-})\pi_{\text{ref}}(Y^{+})}-\tau^{-1 / 2}\right)^{2}\right] \] - **Nash - MD - PG Update Rule**: \[ \nabla \log \pi(y)\left(p(y \succ y')-\frac{1}{2}-\tau \log \frac{\pi(y)}{\pi_{\text{ref}}(y)}\right) \] - **IPO - MD Loss Function**: \[ L_{\text{IPO - MD}}=\mathbb{E}_{Y, Y' \sim \text{SG}[\pi^{1-\beta}(\pi_{\text{ref}})^{\beta}]}\left[\left(\log \frac{\pi(Y^{+})\pi_{\text{ref}}(Y^{-})}{\pi(Y^{-})\pi_{\text{ref}}(Y^{+})}-\tau^{-1 / 2}\right)^{2}\right] \] Through these formulas, it can be seen how the author realizes the transition from offline to online methods through mathematical derivation and experimental verification, and finally proposes new optimization algorithms.

Human Alignment of Large Language Models through Online Preference Optimisation

SAIL: Self-Improving Efficient Online Alignment of Large Language Models

Fine-Tuning Language Models with Advantage-Induced Policy Alignment

Offline Regularised Reinforcement Learning for Large Language Models Alignment

Statistical Rejection Sampling Improves Preference Optimization

Direct Alignment of Language Models via Quality-Aware Self-Refinement

Understanding Likelihood Over-optimisation in Direct Alignment Algorithms

Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization

Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback

Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

Towards Efficient Exact Optimization of Language Model Alignment

RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models

Accelerated Preference Optimization for Large Language Model Alignment

On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization

Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment

Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model

Learn Your Reference Model for Real Good Alignment

Self-Play with Adversarial Critic: Provable and Scalable Offline Alignment for Language Models

Aligning Language Models with Offline Learning from Human Feedback

Ordinal Preference Optimization: Aligning Human Preferences via NDCG

Nash Learning from Human Feedback