Direct Preference Optimization with an Offset

Afra Amini,Tim Vieira,Ryan Cotterell
2024-06-06
Abstract:Direct preference optimization (DPO) is a successful fine-tuning strategy for aligning large language models with human preferences without the need to train a reward model or employ reinforcement learning. DPO, as originally formulated, relies on binary preference data and fine-tunes a language model to increase the likelihood of a preferred response over a dispreferred response. However, not all preference pairs are equal. Sometimes, the preferred response is only slightly better than the dispreferred one. In other cases, the preference is much stronger. For instance, if a response contains harmful or toxic content, the annotator will have a strong preference for that response. In this paper, we propose a generalization of DPO, termed DPO with an offset (ODPO), that does not treat every preference pair equally during fine-tuning. Intuitively, ODPO requires the difference between the likelihood of the preferred and dispreferred response to be greater than an offset value. The offset is determined based on the extent to which one response is preferred over another. Our experiments on various tasks suggest that ODPO significantly outperforms DPO in aligning language models, especially when the number of preference pairs is limited.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to adjust large - language models more effectively to conform to human preferences, especially in the case of limited preference - pair data. Specifically, the paper proposes a new method - Offset - augmented Direct Preference Optimization (ODPO) - to improve the existing Direct Preference Optimization (DPO) method. DPO fine - tunes the language model through binary preference data, making the model more likely to generate responses preferred by humans. However, DPO does not take into account the degree of difference between different preference pairs, that is, there may be significant quality differences between some preference pairs, while others have smaller differences. For example, if a response contains harmful or toxic content, then the human preference intensity for it will be very strong. Therefore, the paper proposes to introduce an offset value on the basis of DPO, so that the model can adjust the likelihood of responses according to the actual reward differences between preference pairs during the fine - tuning process. In this way, for those response pairs with greater preference intensity, the model will work harder to increase the likelihood gap between them and non - preferred responses. ### Formula Analysis 1. **DPO Loss Function**: \[ L_{\text{DPO}}(\theta)=-\mathbb{E}_{(x, y_w, y_l)\sim D_{\text{HF}}}\left[\log\sigma\left(b_r^\theta(x, y_w)-b_r^\theta(x, y_l)\right)\right] \] where \(b_r^\theta(x, y)=\beta\log\frac{\pi_\theta(y|x)}{\pi_{\text{SFT}}(y|x)}\) is the estimated reward. 2. **ODPO Loss Function**: \[ L_{\text{ODPO}}(\theta)=-\mathbb{E}_{(x, y_w, y_l)\sim D_{\text{HF}}}\left[\log\sigma\left(b_r^\theta(x, y_w)-b_r^\theta(x, y_l)-\Delta r\right)\right] \] where \(\Delta r\) is the offset value determined according to the actual reward difference. ### Experimental Results The paper verifies the effectiveness of ODPO through multiple experiments: 1. **Emotion Control Task**: - Use the IMDB dataset to train the GPT2 - Large model to generate movie reviews with positive emotions. - The results show that in all experimental settings, ODPO generates more positive - emotion samples without deviating too much from the initial model. 2. **Toxicity Control Task**: - The goal is to reduce the toxicity of the generated content. - Use the GPT - neo - 2.7b model and the REALTOXICITY PROMPTS dataset. - The results indicate that ODPO is significantly better than DPO in reducing toxicity, especially when the dataset is small. 3. **Summarization Task**: - Use the REDDIT TL;DR dataset and design the offset value according to human scores. - The results show that ODPO has a higher winning rate at different sampling temperatures, especially significantly better than DPO at low temperatures. ### Summary By introducing Offset - augmented Direct Preference Optimization (ODPO), the paper solves the problem that DPO cannot distinguish the degree of difference between different preference pairs. The experimental results show that ODPO performs excellently in multiple tasks, especially when the dataset is small, and can adjust the language model more effectively to conform to human preferences.