Abstract:Aligning Large Language Models (LLMs) with diverse human preferences is a pivotal technique for controlling model behaviors and enhancing generation quality. Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO), and their variants optimize language models by pairwise comparisons. However, when multiple responses are available, these approaches fall short of leveraging the extensive information in the ranking given by the reward models or human feedback. In this work, we propose a novel listwise approach named Ordinal Preference Optimization (OPO), which employs the Normalized Discounted Cumulative Gain (NDCG), a widely-used ranking metric, to better utilize relative proximity within ordinal multiple responses. We develop an end-to-end preference optimization algorithm by approximating NDCG with a differentiable surrogate loss. This approach builds a connection between ranking models in information retrieval and the alignment problem. In aligning multi-response datasets assigned with ordinal rewards, OPO outperforms existing pairwise and listwise approaches on evaluation sets and general benchmarks like AlpacaEval. Moreover, we demonstrate that increasing the pool of negative samples can enhance model performance by reducing the adverse effects of trivial negatives.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of how to better align large - language models (LLMs) with diverse user preferences. Specifically, existing methods such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) mainly rely on pairwise comparisons to optimize language models. However, when multiple responses are available, these methods fail to fully utilize the ranking information provided by the reward model or human feedback. To this end, the paper proposes a new list - based method - **Ordinal Preference Optimization (OPO)**, which uses the widely - used ranking metric **Normalized Discounted Cumulative Gain (NDCG)** to better utilize the relative order information among multiple responses. By introducing a differentiable surrogate loss function to approximate NDCG, OPO can perform preference optimization in an end - to - end framework and establish a connection between the ranking model in information retrieval and the LLM alignment problem. ### Main contributions 1. **Propose a new list - based alignment method OPO**: This method can utilize multiple ordered responses and shows superior performance on models of different scales compared to existing pairwise and list - based methods. 2. **Establish a connection between the ranking model in information retrieval and the LLM alignment problem**: Demonstrate the effectiveness of directly optimizing ranking metrics for LLM alignment. 3. **Construct a dataset containing ordered multi - responses**: And prove that increasing the negative sample pool can improve the performance of existing pairwise methods. ### Specific problem description - **Resource - intensiveness and sensitivity**: The RLHF process is resource - intensive and very sensitive to hyperparameters. - **Limitations of pairwise comparisons**: Although existing pairwise comparison methods are effective, they cannot fully utilize the information of the entire response list when dealing with multi - response data. - **Nondifferentiability challenge**: As an important evaluation metric, the nondifferentiable nature of NDCG makes it difficult to be directly used for training. By introducing OPO, the paper provides a more comprehensive and effective solution that can better align LLMs with human preferences, thereby improving the generation quality. ### Summary of mathematical formulas - **Reward score calculation**: \[ s(x, y)=\beta \log \frac{\pi_{\theta}(y|x)}{\pi_{\text{ref}}(y|x)} \] where $\pi_{\theta}(y|x)$ and $\pi_{\text{ref}}(y|x)$ represent the probabilities of the policy model and the reference model generating response $y$ given prompt $x$, respectively. - **NDCG calculation**: \[ \text{DCG}@k = \sum_{j = 1}^{k}G(\psi_{j})D(\tau(j)) \] \[ \text{NDCG}@k=\frac{\text{DCG}@k}{\text{maxDCG}@k} \] where $G(\psi_{j}) = 2^{\psi_{j}}- 1$ is the gain function, $D(\tau(j))=\frac{1}{\log_{2}(\tau(j)+1)}$ is the discount function, and $\tau(j)$ is the ranking position after re - ranking based on the reward score calculated by the current model. - **NeuralNDCG approximation**: \[ \text{NeuralNDCG}@k(\tau; s, \Psi)=\frac{1}{\text{maxDCG}@k}\sum_{j = 1}^{k}(\text{scale}(bP)\cdot G(\Psi))_{j}\cdot D(j) \] where $\text{scale}(bP)$ represents the Sinkhorn scaling operation to ensure that the sum of each column is 1. Through these improvements, the OPO method can more effectively utilize the relative order information in multi - response data.

Ordinal Preference Optimization: Aligning Human Preferences via NDCG

Optimizing Preference Alignment with Differentiable NDCG Ranking

Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts

Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model

Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization

MPPO: Multi Pair-wise Preference Optimization for LLMs with Arbitrary Negative Samples

Cal-DPO: Calibrated Direct Preference Optimization for Language Model Alignment

$α$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs

Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization

New Desiderata for Direct Preference Optimization

Direct Preference Optimization with an Offset

Accelerated Preference Optimization for Large Language Model Alignment

SPO: Multi-Dimensional Preference Sequential Alignment With Implicit Reward Modeling

Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness

MallowsPO: Fine-Tune Your LLM with Preference Dispersions

Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization

ULMA: Unified Language Model Alignment with Human Demonstration and Point-wise Preference

Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM Game

Token-level Direct Preference Optimization

Aligning CodeLLMs with Direct Preference Optimization

Uncertainty-Penalized Direct Preference Optimization