Entropy Controllable Direct Preference Optimization

Motoki Omura,Yasuhiro Fujita,Toshiki Kataoka

2024-11-12

Abstract:In the post-training of large language models (LLMs), Reinforcement Learning from Human Feedback (RLHF) is an effective approach to achieve generation aligned with human preferences. Direct Preference Optimization (DPO) allows for policy training with a simple binary cross-entropy loss without a reward model. The objective of DPO is regularized by reverse KL divergence that encourages mode-seeking fitting to the reference policy. Nonetheless, we indicate that minimizing reverse KL divergence could fail to capture a mode of the reference distribution, which may hurt the policy's performance. Based on this observation, we propose a simple modification to DPO, H-DPO, which allows for control over the entropy of the resulting policy, enhancing the distribution's sharpness and thereby enabling mode-seeking fitting more effectively. In our experiments, we show that H-DPO outperformed DPO across various tasks, demonstrating superior results in pass@$k$ evaluations for mathematical tasks. Moreover, H-DPO is simple to implement, requiring only minor modifications to the loss calculation of DPO, which makes it highly practical and promising for wide-ranging applications in the training of LLMs.

Machine Learning,Artificial Intelligence,Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to more effectively align with human preferences through the Direct Preference Optimization (DPO) method during the post - training process of large - language models (LLMs). Specifically, the paper points out that although the traditional DPO method can achieve mode - seeking fitting by minimizing the reverse KL divergence, this method sometimes fails and cannot capture the mode of the target distribution. To solve this problem, the paper proposes an improved DPO method - H - DPO. By introducing a hyperparameter α to control the entropy of the generation strategy, the effect of mode - seeking fitting is enhanced, enabling the model to generate more diverse outputs while maintaining performance. The main contributions of the paper are as follows: 1. **Proposing H - DPO**: This is a new direct preference optimization method that allows the entropy of the generation strategy to be controlled by adjusting α, thereby achieving more effective mode - seeking fitting. 2. **Experimental verification**: Through experiments on multiple tasks, it has been proven that H - DPO has advantages over traditional DPO in terms of performance and diversity, especially performing well in the pass@k evaluation of math tasks. 3. **Simple and easy to implement**: The implementation of H - DPO only requires a small number of modifications to the existing DPO, which makes it highly practical and has broad application potential. In conclusion, this paper aims to improve the alignment effect of large - language models during the post - training process through the H - DPO method, making their generated outputs more accurate and diverse.

Entropy Controllable Direct Preference Optimization

Uncertainty-Penalized Direct Preference Optimization

Minor DPO reject penalty to increase training robustness

SEE-DPO: Self Entropy Enhanced Direct Preference Optimization

Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

New Desiderata for Direct Preference Optimization

Provably Robust DPO: Aligning Language Models with Noisy Feedback

Direct Multi-Turn Preference Optimization for Language Agents

$α$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs

Direct Preference Optimization with an Offset

Towards Analyzing and Understanding the Limitations of DPO: A Theoretical Perspective

Enhancing LLM Safety via Constrained Direct Preference Optimization

Direct Preference Optimization With Unobserved Preference Heterogeneity

On the Generalization of Preference Learning with DPO

Token-level Direct Preference Optimization

MallowsPO: Fine-Tune Your LLM with Preference Dispersions

Hybrid Preference Optimization: Augmenting Direct Preference Optimization with Auxiliary Objectives