Generalized Preference Optimization: A Unified Approach to Offline Alignment

Yunhao Tang,Zhaohan Daniel Guo,Zeyu Zheng,Daniele Calandriello,Rémi Munos,Mark Rowland,Pierre Harvey Richemond,Michal Valko,Bernardo Ávila Pires,Bilal Piot
2024-05-29
Abstract:Offline preference optimization allows fine-tuning large models directly from offline data, and has proved effective in recent alignment practices. We propose generalized preference optimization (GPO), a family of offline losses parameterized by a general class of convex functions. GPO enables a unified view over preference optimization, encompassing existing algorithms such as DPO, IPO and SLiC as special cases, while naturally introducing new variants. The GPO framework also sheds light on how offline algorithms enforce regularization, through the design of the convex function that defines the loss. Our analysis and experiments reveal the connections and subtle differences between the offline regularization and the KL divergence regularization intended by the canonical RLHF formulation. In a controlled setting akin to Gao et al 2023, we also show that different GPO variants achieve similar trade-offs between regularization and performance, though the optimal values of hyper-parameter might differ as predicted by theory. In all, our results present new algorithmic toolkits and empirical insights to alignment practitioners.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
This paper focuses on the problem of Offline Preference Optimization, which is a method to adjust large-scale models directly from paired comparison datasets without interactive data collection. It plays a key role in aligning artificial intelligence systems, especially Reinforcement Learning from Human Feedback (RLHF). Existing offline preference optimization algorithms, such as DPO, IPO, and SLiC, achieve this through different loss functions. The paper proposes Generalized Preference Optimization (GPO), which is a parameterized offline loss method for a general family of convex functions that can unify existing algorithms and introduce new variants. The GPO framework reveals how different offline algorithms achieve regularization through designed convex functions and analyzes the difference between offline loss and KL divergence regularization in the standard form of reinforcement learning. The paper also explores the trade-off between regularization and performance in the offline optimization process, demonstrating the performance of different GPO variants in control settings through experiments. Although the optimal hyperparameters may vary depending on the variant, they exhibit similar trade-off trends in performance. In addition, the paper treats reward modeling as a supervised binary classification problem and utilizes the rich theory of supervised learning to unify offline alignment algorithms. It also discusses how different loss functions affect the regularization strength and how to choose suitable hyperparameters. Through this approach, GPO provides new toolkits and empirical insights to help alignment practitioners better understand and optimize AI systems.