Abstract:Aligning Large Language Models (LLMs) to human preferences in content, style, and presentation is challenging, in part because preferences are varied, context-dependent, and sometimes inherently ambiguous. While successful, Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are prone to the issue of proxy reward overoptimization. Analysis of the DPO loss reveals a critical need for regularization for mislabeled or ambiguous preference pairs to avoid reward hacking. In this work, we develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes, inspired by offline reinforcement learning. The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples. Evaluation of the methods is performed with GPT2 Medium on the Anthropic-HH dataset using a model ensemble to obtain uncertainty estimates, and shows improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper attempts to address the challenge of aligning large language models (LLMs) with human preferences in terms of content, style, and presentation. Specifically, the paper focuses on the issue of over-optimization in the Direct Preference Optimization (DPO) method. DPO is an effective approach that fine-tunes language models by maximizing the likelihood of preference data under the Bradley-Terry model, but it is prone to overfitting on the preference dataset, especially for mislabeled or ambiguous preference pairs. ### Main Contributions 1. **Detailed Analysis of DPO**: The paper provides a detailed analysis of DPO and its overfitting issues, revealing DPO's sensitivity to mislabeled samples. 2. **Introduction of a Pessimistic Framework**: The paper proposes a pessimistic framework to address DPO's overfitting problem by introducing an uncertainty penalty. This framework draws on pessimism techniques from offline reinforcement learning. 3. **Performance Validation**: The paper demonstrates that the proposed method outperforms the unpenalized objective function across various tasks and robustness experiments. ### Method Overview - **Standard Uncertainty Penalty**: The paper first applies the standard Lower Confidence Bound (LCB) penalty to DPO, obtaining conservative reward estimates by subtracting uncertainty from the reward scores. - **Main Method: Energy Factor Penalty**: The paper proposes a multiplicative penalty scheme that ensures penalization by multiplying the preference values or rewards by an uncertainty energy function. This penalty can be adjusted via a temperature parameter. ### Experimental Results - **Performance Evaluation**: The penalized DPO model performs comparably or better than the standard DPO across all penalty intensities, with the best performance observed at 30% penalty intensity. - **Robustness Evaluation**: Performance evaluation on high-uncertainty samples shows that the multiplicative penalty scheme excels in handling high-uncertainty choice/rejection pairs. ### Conclusion The paper proposes a new DPO framework that addresses DPO's overfitting problem by introducing pessimism and leveraging preference uncertainty estimates. Experimental results indicate that the energy factor penalty scheme performs best in terms of overall performance and robustness. Future work will further evaluate the performance of more powerful models on various tasks and extend to other methods such as ΨPO and IPO.

Uncertainty-Penalized Direct Preference Optimization

$α$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs

Direct Preference Optimization With Unobserved Preference Heterogeneity

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model

New Desiderata for Direct Preference Optimization

Entropy Controllable Direct Preference Optimization

Provably Robust DPO: Aligning Language Models with Noisy Feedback

Direct Preference Optimization with an Offset

Active Preference Learning for Large Language Models

Enhancing LLM Safety via Constrained Direct Preference Optimization

Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer

Hybrid Preference Optimization: Augmenting Direct Preference Optimization with Auxiliary Objectives

Robust Preference Optimization through Reward Model Distillation

RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models

Minor DPO reject penalty to increase training robustness

Understanding Likelihood Over-optimisation in Direct Alignment Algorithms

On the Generalization of Preference Learning with DPO

A General Theoretical Paradigm to Understand Learning from Human Preferences