Abstract:Learning from preference-based feedback has recently gained traction as a promising approach to align language models with human interests. While these aligned generative models have demonstrated impressive capabilities across various tasks, their dependence on high-quality human preference data poses a bottleneck in practical applications. Specifically, noisy (incorrect and ambiguous) preference pairs in the dataset might restrict the language models from capturing human intent accurately. While practitioners have recently proposed heuristics to mitigate the effect of noisy preferences, a complete theoretical understanding of their workings remain elusive. In this work, we aim to bridge this gap by by introducing a general framework for policy optimization in the presence of random preference flips. We focus on the direct preference optimization (DPO) algorithm in particular since it assumes that preferences adhere to the Bradley-Terry-Luce (BTL) model, raising concerns about the impact of noisy data on the learned policy. We design a novel loss function, which de-bias the effect of noise on average, making a policy trained by minimizing that loss robust to the noise. Under log-linear parameterization of the policy class and assuming good feature coverage of the SFT policy, we prove that the sub-optimality gap of the proposed robust DPO (rDPO) policy compared to the optimal policy is of the order $O(\frac{1}{1-2\epsilon}\sqrt{\frac{d}{n}})$, where $\epsilon < 1/2$ is flip rate of labels, $d$ is policy parameter dimension and $n$ is size of dataset. Our experiments on IMDb sentiment generation and Anthropic's helpful-harmless dataset show that rDPO is robust to noise in preference labels compared to vanilla DPO and other heuristics proposed by practitioners.

Provably Robust DPO: Aligning Language Models with Noisy Feedback

Provably Robust DPO: Aligning Language Models with Noisy Feedback

Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization

ROPO: Robust Preference Optimization for Large Language Models

Aligning Large Language Models with Counterfactual DPO

Uncertainty-Penalized Direct Preference Optimization

Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model

On the Generalization of Preference Learning with DPO

Impact of Preference Noise on the Alignment Performance of Generative Language Models

Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both

Robust Preference Optimization through Reward Model Distillation

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization

Negating Negatives: Alignment with Human Negative Samples via Distributional Dispreference Optimization

Direct Preference Optimization with an Offset

Active Preference Learning for Large Language Models

Hybrid Preference Optimization: Augmenting Direct Preference Optimization with Auxiliary Objectives

Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level

Understanding Likelihood Over-optimisation in Direct Alignment Algorithms