Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

Noam Razin,Sadhika Malladi,Adithya Bhaskar,Danqi Chen,Sanjeev Arora,Boris Hanin
2024-10-14
Abstract:Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences. Although these methods are designed to teach a model to generate preferred responses more frequently relative to dispreferred responses, prior work has observed that the likelihood of preferred responses often decreases during training. The current work sheds light on the causes and implications of this counter-intuitive phenomenon, which we term likelihood displacement. We demonstrate that likelihood displacement can be catastrophic, shifting probability mass from preferred responses to responses with an opposite meaning. As a simple example, training a model to prefer $\texttt{No}$ over $\texttt{Never}$ can sharply increase the probability of $\texttt{Yes}$. Moreover, when aligning the model to refuse unsafe prompts, we show that such displacement can unintentionally lead to unalignment, by shifting probability mass from preferred refusal responses to harmful responses (e.g., reducing the refusal rate of Llama-3-8B-Instruct from 74.4% to 33.4%). We theoretically characterize that likelihood displacement is driven by preferences that induce similar embeddings, as measured by a centered hidden embedding similarity (CHES) score. Empirically, the CHES score enables identifying which training samples contribute most to likelihood displacement in a given dataset. Filtering out these samples effectively mitigated unintentional unalignment in our experiments. More broadly, our results highlight the importance of curating data with sufficiently distinct preferences, for which we believe the CHES score may prove valuable.
Machine Learning,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to explore and address an anomalous phenomenon that occurs during the Direct Preference Optimization (DPO) process—**likelihood displacement**. Although DPO and its variant methods are designed to make models generate responses that align more frequently with human preferences, previous studies have found that the likelihood (i.e., probability) of preferred responses often decreases during training. This phenomenon not only contradicts expectations but may also lead to the generation of responses that are contrary to preferences, thereby causing serious alignment issues. Specifically, the paper focuses on the following aspects: 1. **Causes of Likelihood Displacement**: Through theoretical analysis and experimental validation, the paper reveals the fundamental reasons behind likelihood displacement. The research shows that likelihood displacement is driven by the geometric structure of the model's embeddings, particularly the Centralized Hidden Embedding Similarity (CHES) score. 2. **Impact of Likelihood Displacement**: The paper demonstrates the severe consequences that likelihood displacement can cause, such as shifting probability mass from preferred responses to opposite responses. A simple example is that training a model to prefer "No" over "Never" may significantly increase the probability of "Yes." Additionally, when attempting to make the model reject unsafe prompts, likelihood displacement may cause the model to shift from rejecting to accepting unsafe prompts, thereby drastically reducing the rejection rate. 3. **Preventive Measures**: The paper proposes a method based on the CHES score to identify and filter out training samples that are likely to cause likelihood displacement. Experimental results show that this method is more effective than other methods (such as adding supervised fine-tuning terms) in preventing unintended alignment issues. ### Main Contributions 1. **Revealing the Prevalence and Severity of Likelihood Displacement**: Likelihood displacement is prevalent even in simple settings and can lead to serious alignment issues. 2. **Theoretical Analysis**: By analyzing the geometric structure of the model's embeddings, the paper provides a theoretical explanation for likelihood displacement and introduces the CHES score as a metric for measuring preference similarity. 3. **Identifying the Source of Likelihood Displacement**: Based on the CHES score, the paper offers an effective method to identify which training samples are most likely to cause likelihood displacement. 4. **Preventive Measures**: Through experiments, it is demonstrated that filtering out samples with high CHES scores can effectively mitigate unintended alignment issues, thereby improving the safety and reliability of the model. In summary, through theoretical and empirical research, this paper reveals the mechanisms and impacts of likelihood displacement in the DPO process and provides effective preventive measures, which are significant for ensuring the safe alignment of language models.