Entropy Controllable Direct Preference Optimization

Motoki Omura,Yasuhiro Fujita,Toshiki Kataoka
2024-11-12
Abstract:In the post-training of large language models (LLMs), Reinforcement Learning from Human Feedback (RLHF) is an effective approach to achieve generation aligned with human preferences. Direct Preference Optimization (DPO) allows for policy training with a simple binary cross-entropy loss without a reward model. The objective of DPO is regularized by reverse KL divergence that encourages mode-seeking fitting to the reference policy. Nonetheless, we indicate that minimizing reverse KL divergence could fail to capture a mode of the reference distribution, which may hurt the policy's performance. Based on this observation, we propose a simple modification to DPO, H-DPO, which allows for control over the entropy of the resulting policy, enhancing the distribution's sharpness and thereby enabling mode-seeking fitting more effectively. In our experiments, we show that H-DPO outperformed DPO across various tasks, demonstrating superior results in pass@$k$ evaluations for mathematical tasks. Moreover, H-DPO is simple to implement, requiring only minor modifications to the loss calculation of DPO, which makes it highly practical and promising for wide-ranging applications in the training of LLMs.
Machine Learning,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to more effectively align with human preferences through the Direct Preference Optimization (DPO) method during the post - training process of large - language models (LLMs). Specifically, the paper points out that although the traditional DPO method can achieve mode - seeking fitting by minimizing the reverse KL divergence, this method sometimes fails and cannot capture the mode of the target distribution. To solve this problem, the paper proposes an improved DPO method - H - DPO. By introducing a hyperparameter α to control the entropy of the generation strategy, the effect of mode - seeking fitting is enhanced, enabling the model to generate more diverse outputs while maintaining performance. The main contributions of the paper are as follows: 1. **Proposing H - DPO**: This is a new direct preference optimization method that allows the entropy of the generation strategy to be controlled by adjusting α, thereby achieving more effective mode - seeking fitting. 2. **Experimental verification**: Through experiments on multiple tasks, it has been proven that H - DPO has advantages over traditional DPO in terms of performance and diversity, especially performing well in the pass@k evaluation of math tasks. 3. **Simple and easy to implement**: The implementation of H - DPO only requires a small number of modifications to the existing DPO, which makes it highly practical and has broad application potential. In conclusion, this paper aims to improve the alignment effect of large - language models during the post - training process through the H - DPO method, making their generated outputs more accurate and diverse.