Abstract:Offline preference optimization allows fine-tuning large models directly from offline data, and has proved effective in recent alignment practices. We propose generalized preference optimization (GPO), a family of offline losses parameterized by a general class of convex functions. GPO enables a unified view over preference optimization, encompassing existing algorithms such as DPO, IPO and SLiC as special cases, while naturally introducing new variants. The GPO framework also sheds light on how offline algorithms enforce regularization, through the design of the convex function that defines the loss. Our analysis and experiments reveal the connections and subtle differences between the offline regularization and the KL divergence regularization intended by the canonical RLHF formulation. In a controlled setting akin to Gao et al 2023, we also show that different GPO variants achieve similar trade-offs between regularization and performance, though the optimal values of hyper-parameter might differ as predicted by theory. In all, our results present new algorithmic toolkits and empirical insights to alignment practitioners.

What problem does this paper attempt to address?

This paper focuses on the problem of Offline Preference Optimization, which is a method to adjust large-scale models directly from paired comparison datasets without interactive data collection. It plays a key role in aligning artificial intelligence systems, especially Reinforcement Learning from Human Feedback (RLHF). Existing offline preference optimization algorithms, such as DPO, IPO, and SLiC, achieve this through different loss functions. The paper proposes Generalized Preference Optimization (GPO), which is a parameterized offline loss method for a general family of convex functions that can unify existing algorithms and introduce new variants. The GPO framework reveals how different offline algorithms achieve regularization through designed convex functions and analyzes the difference between offline loss and KL divergence regularization in the standard form of reinforcement learning. The paper also explores the trade-off between regularization and performance in the offline optimization process, demonstrating the performance of different GPO variants in control settings through experiments. Although the optimal hyperparameters may vary depending on the variant, they exhibit similar trade-off trends in performance. In addition, the paper treats reward modeling as a supervised binary classification problem and utilizes the rich theory of supervised learning to unify offline alignment algorithms. It also discusses how different loss functions affect the regularization strength and how to choose suitable hyperparameters. Through this approach, GPO provides new toolkits and empirical insights to help alignment practitioners better understand and optimize AI systems.

Generalized Preference Optimization: A Unified Approach to Offline Alignment

Beyond Reward: Offline Preference-guided Policy Optimization

Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization

$α$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs

The Importance of Online Data: Understanding Preference Fine-tuning via Coverage

Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model

Discovering Preference Optimization Algorithms with and for Large Language Models

Online DPO: Online Direct Preference Optimization with Fast-Slow Chasing

Group Preference Optimization: Few-Shot Alignment of Large Language Models

Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts

Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization

The Hitchhiker's Guide to Human Alignment with *PO

Direct Preference Optimization Using Sparse Feature-Level Constraints

Self-Improving Robust Preference Optimization

New Desiderata for Direct Preference Optimization

$f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization

Ordinal Preference Optimization: Aligning Human Preferences via NDCG

Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint

Uncertainty-Penalized Direct Preference Optimization

Orthogonal Finetuning for Direct Preference Optimization

OPTune: Efficient Online Preference Tuning