Uncertainty-Penalized Direct Preference Optimization

Sam Houliston,Alizée Pace,Alexander Immer,Gunnar Rätsch
2024-10-26
Abstract:Aligning Large Language Models (LLMs) to human preferences in content, style, and presentation is challenging, in part because preferences are varied, context-dependent, and sometimes inherently ambiguous. While successful, Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are prone to the issue of proxy reward overoptimization. Analysis of the DPO loss reveals a critical need for regularization for mislabeled or ambiguous preference pairs to avoid reward hacking. In this work, we develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes, inspired by offline reinforcement learning. The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples. Evaluation of the methods is performed with GPT2 Medium on the Anthropic-HH dataset using a model ensemble to obtain uncertainty estimates, and shows improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper attempts to address the challenge of aligning large language models (LLMs) with human preferences in terms of content, style, and presentation. Specifically, the paper focuses on the issue of over-optimization in the Direct Preference Optimization (DPO) method. DPO is an effective approach that fine-tunes language models by maximizing the likelihood of preference data under the Bradley-Terry model, but it is prone to overfitting on the preference dataset, especially for mislabeled or ambiguous preference pairs. ### Main Contributions 1. **Detailed Analysis of DPO**: The paper provides a detailed analysis of DPO and its overfitting issues, revealing DPO's sensitivity to mislabeled samples. 2. **Introduction of a Pessimistic Framework**: The paper proposes a pessimistic framework to address DPO's overfitting problem by introducing an uncertainty penalty. This framework draws on pessimism techniques from offline reinforcement learning. 3. **Performance Validation**: The paper demonstrates that the proposed method outperforms the unpenalized objective function across various tasks and robustness experiments. ### Method Overview - **Standard Uncertainty Penalty**: The paper first applies the standard Lower Confidence Bound (LCB) penalty to DPO, obtaining conservative reward estimates by subtracting uncertainty from the reward scores. - **Main Method: Energy Factor Penalty**: The paper proposes a multiplicative penalty scheme that ensures penalization by multiplying the preference values or rewards by an uncertainty energy function. This penalty can be adjusted via a temperature parameter. ### Experimental Results - **Performance Evaluation**: The penalized DPO model performs comparably or better than the standard DPO across all penalty intensities, with the best performance observed at 30% penalty intensity. - **Robustness Evaluation**: Performance evaluation on high-uncertainty samples shows that the multiplicative penalty scheme excels in handling high-uncertainty choice/rejection pairs. ### Conclusion The paper proposes a new DPO framework that addresses DPO's overfitting problem by introducing pessimism and leveraging preference uncertainty estimates. Experimental results indicate that the energy factor penalty scheme performs best in terms of overall performance and robustness. Future work will further evaluate the performance of more powerful models on various tasks and extend to other methods such as ΨPO and IPO.