Abstract:Aligning large language models (LLMs) with human values and intentions is crucial for their utility, honesty, and safety. Reinforcement learning from human feedback (RLHF) is a popular approach to achieve this alignment, but it faces challenges in computational efficiency and training stability. Recent methods like Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO) have proposed offline alternatives to RLHF, simplifying the process by reparameterizing the reward function. However, DPO depends on a potentially suboptimal reference model, and SimPO's assumption of a fixed target reward margin may lead to suboptimal decisions in diverse data settings. In this work, we propose $\alpha$-DPO, an adaptive preference optimization algorithm designed to address these limitations by introducing a dynamic reward margin. Specifically, $\alpha$-DPO employs an adaptive preference distribution, balancing the policy model and the reference model to achieve personalized reward margins. We provide theoretical guarantees for $\alpha$-DPO, demonstrating its effectiveness as a surrogate optimization objective and its ability to balance alignment and diversity through KL divergence control. Empirical evaluations on AlpacaEval 2 and Arena-Hard show that $\alpha$-DPO consistently outperforms DPO and SimPO across various model settings, establishing it as a robust approach for fine-tuning LLMs. Our method achieves significant improvements in win rates, highlighting its potential as a powerful tool for LLM alignment. The code is available at <a class="link-external link-https" href="https://github.com/junkangwu/alpha-DPO" rel="external noopener nofollow">this https URL</a>

HyperDPO: Conditioned One-Shot Multi-Objective Fine-Tuning Framework

Orthogonal Finetuning for Direct Preference Optimization

Online DPO: Online Direct Preference Optimization with Fast-Slow Chasing

mDPO: Conditional Preference Optimization for Multimodal Large Language Models

Token-level Direct Preference Optimization

Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization

MPPO: Multi Pair-wise Preference Optimization for LLMs with Arbitrary Negative Samples

Direct Preference Optimization with an Offset

Minor DPO reject penalty to increase training robustness

ASFT: Aligned Supervised Fine-Tuning through Absolute Likelihood

Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model

Towards Analyzing and Understanding the Limitations of DPO: A Theoretical Perspective

sDPO: Don't Use Your Data All at Once

Direct Multi-Turn Preference Optimization for Language Agents

DPO Meets PPO: Reinforced Token Optimization for RLHF

Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

Hybrid Preference Optimization: Augmenting Direct Preference Optimization with Auxiliary Objectives

HyperTuner: A Cross-Layer Multi-Objective Hyperparameter Auto-Tuning Framework for Data Analytic Services

$α$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs