3D-Properties: Identifying Challenges in DPO and Charting a Path Forward

Yuzi Yan,Yibo Miao,Jialian Li,Yipin Zhang,Jian Xie,Zhijie Deng,Dong Yan

2024-06-11

Abstract:Aligning large language models (LLMs) with human preference has recently gained tremendous attention, with the canonical yet costly RLHF-PPO and the simple and straightforward Direct Preference Optimization (DPO) as two examples. Despite the efficiency, DPO has rarely be used in the state-of-the-art production-level LLMs, implying its potential pathologies. In this work, we revisit DPO with a comprehensive examination of its empirical efficacy and a systematic comparison with RLHF-PPO. We identify the \textbf{3D}-properties of DPO's learning outcomes: the \textbf{D}rastic drop in the likelihood of rejected responses, the \textbf{D}egradation into LLM unlearning, and the \textbf{D}ispersion effect on unseen responses through experiments with both a carefully designed toy model and practical LLMs on tasks including mathematical problem-solving and instruction following. These findings inherently connect to some observations made by related works and we additionally contribute a plausible theoretical explanation for them. Accordingly, we propose easy regularization methods to mitigate the issues caused by \textbf{3D}-properties, improving the training stability and final performance of DPO. Our contributions also include an investigation into how the distribution of the paired preference data impacts the effectiveness of DPO. We hope this work could offer research directions to narrow the gap between reward-free preference learning methods and reward-based ones.

Artificial Intelligence,Computation and Language,Machine Learning

What problem does this paper attempt to address?

This paper mainly discusses the problems of Direct Preference Optimization (DPO) in training Large Language Models (LLMs), and proposes three characteristic problems of DPO learning results, namely "sharp decrease in rejection response probability", "LLM reverse learning", and "diffusion effect on unseen responses". These problems affect the stability and final performance of DPO. The researchers analyze the differences between DPO and Reward-Based Reinforcement Learning with Human Feedback (RLHF-PPO) through experiments, and propose the impact of data distribution on the effectiveness of DPO. The paper also proposes several regularization methods to alleviate the problems caused by these three characteristics, in order to improve the training stability and performance of DPO. Furthermore, they find that the difference between reward-free preference learning methods and reward-based methods is an important research direction.

3D-Properties: Identifying Challenges in DPO and Charting a Path Forward

Towards Analyzing and Understanding the Limitations of DPO: A Theoretical Perspective

On the Generalization of Preference Learning with DPO

2D-DPO: Scaling Direct Preference Optimization with 2-Dimensional Supervision

Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Optimizing LLMs with Direct Preferences: A Data Efficiency Perspective

Enhancing LLM Safety via Constrained Direct Preference Optimization

New Desiderata for Direct Preference Optimization

MallowsPO: Fine-Tune Your LLM with Preference Dispersions

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

Aligning CodeLLMs with Direct Preference Optimization

sDPO: Don't Use Your Data All at Once

Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level

Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model

Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization

Uncertainty-Penalized Direct Preference Optimization