Towards Analyzing and Understanding the Limitations of DPO: A Theoretical Perspective

Duanyu Feng,Bowen Qin,Chen Huang,Zheng Zhang,Wenqiang Lei
2024-04-06
Abstract:Direct Preference Optimization (DPO), which derives reward signals directly from pairwise preference data, has shown its effectiveness on aligning Large Language Models (LLMs) with human preferences. Despite its widespread use across various tasks, DPO has been criticized for its sensitivity to the SFT's effectiveness and its hindrance to the learning capacity towards human-preferred responses, leading to less satisfactory performance. To overcome those limitations, the theoretical understanding of DPO are indispensable but still lacking. To this end, we take a step towards theoretically analyzing and understanding the limitations of DPO. Specifically, we provide an analytical framework using the field theory to analyze the optimization process of DPO. By analyzing the gradient vector field of the DPO loss function, we find that the DPO loss function decreases the probability of producing human dispreferred data at a faster rate than it increases the probability of producing preferred data. This provides theoretical insights for understanding the limitations of DPO discovered in the related research experiments, thereby setting the foundation for its improvement.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper analyzes the limitations of the Direct Preference Optimization (DPO) method in training large-scale language models (LLMs) to align with human preferences. DPO directly obtains reward signals from paired preference data, but it has been criticized for its sensitivity to supervised fine-tuning (SFT) and its ability to hinder learning to generate human-preferred responses. The study uses field theory to provide an analytical framework and finds that the gradient vector field of the DPO loss function reduces the probability of generating undesirable data faster, while the speed of increasing the probability of generating desirable data is slower. This explains why DPO restricts LLMs from learning to generate human-preferred responses and its sensitivity to the effectiveness of SFT. The paper aims to provide a theoretical foundation for improving DPO.