Abstract:With the growing utilization of large language models (LLMs) across domains, alignment towards human preferences has become one of the most critical aspects of training models. At the forefront of state-of-the-art human alignment methods are preference optimization methods (*PO). However, prior research has often concentrated on identifying the best-performing method, typically involving a grid search over hyperparameters, which can be impractical for general practitioners. In this paper, we aim to identify the algorithm that, while being performant, is simultaneously more robust to varying hyperparameters, thereby increasing the likelihood of achieving better results. We focus on a realistic out-of-distribution (OOD) scenario that mirrors real-world applications of human alignment, offering practical insights into the strengths and weaknesses of these methods. Furthermore, to better understand the shortcomings of generations from the different methods, we analyze the model generations through the lens of KL divergence of the SFT model and the response length statistics. Our analysis reveals that the widely adopted DPO method consistently produces lengthy responses of inferior quality that are very close to the SFT responses. Motivated by these findings, we propose an embarrassingly simple extension to the DPO algorithm, LN-DPO, resulting in more concise responses without sacrificing quality compared to the policy obtained by vanilla DPO.

What problem does this paper attempt to address?

The paper primarily explores the challenges of aligning large language models (LLMs) with human preferences and attempts to identify a more stable optimization method under hyperparameter variations. Specifically: 1. **Research Background**: With the widespread application of large language models in various fields, aligning these models with human preferences has become a key issue. The current mainstream method is Preference Optimization (PO), but existing research often focuses on finding the best performance method, usually achieved through grid search of hyperparameters, which is impractical for real-world users. 2. **Research Objective**: This paper aims to identify an algorithm that maintains high performance while being more robust to different hyperparameters. The authors focus on an out-of-distribution (OOD) scenario in the real world and provide an analysis of the advantages and disadvantages of these methods in practical applications. 3. **Proposed New Method**: Based on the discovery that responses generated by the widely used DPO method are excessively long and of low quality, the authors propose a simple extension—LN-DPO. This method incorporates a length regularization term, making the generated responses more concise without sacrificing quality. 4. **Experimental Results**: Through experimental comparisons under various hyperparameter settings, the results show that SimPO and LN-DPO outperform DPO on multiple metrics, particularly in response length and KL divergence. Additionally, SimPO is nearly twice as fast as DPO in terms of training time. In summary, this paper not only proposes a new optimization method, LN-DPO, but also systematically compares the performance of several existing preference optimization methods, providing valuable references for real-world users.

The Hitchhiker's Guide to Human Alignment with *PO

RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization

Direct Preference Optimization with an Offset

Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks

Direct Preference Optimization Using Sparse Feature-Level Constraints

Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback

Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts

$α$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs

New Desiderata for Direct Preference Optimization

SPO: Multi-Dimensional Preference Sequential Alignment With Implicit Reward Modeling

Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model

Generalized Preference Optimization: A Unified Approach to Offline Alignment

Towards Efficient Exact Optimization of Language Model Alignment

Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment

$f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization

Preference Ranking Optimization for Human Alignment

Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization