Ordinal Preference Optimization: Aligning Human Preferences via NDCG

Yang Zhao,Yixin Wang,Mingzhang Yin
2024-10-06
Abstract:Aligning Large Language Models (LLMs) with diverse human preferences is a pivotal technique for controlling model behaviors and enhancing generation quality. Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO), and their variants optimize language models by pairwise comparisons. However, when multiple responses are available, these approaches fall short of leveraging the extensive information in the ranking given by the reward models or human feedback. In this work, we propose a novel listwise approach named Ordinal Preference Optimization (OPO), which employs the Normalized Discounted Cumulative Gain (NDCG), a widely-used ranking metric, to better utilize relative proximity within ordinal multiple responses. We develop an end-to-end preference optimization algorithm by approximating NDCG with a differentiable surrogate loss. This approach builds a connection between ranking models in information retrieval and the alignment problem. In aligning multi-response datasets assigned with ordinal rewards, OPO outperforms existing pairwise and listwise approaches on evaluation sets and general benchmarks like AlpacaEval. Moreover, we demonstrate that increasing the pool of negative samples can enhance model performance by reducing the adverse effects of trivial negatives.
Computation and Language
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of how to better align large - language models (LLMs) with diverse user preferences. Specifically, existing methods such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) mainly rely on pairwise comparisons to optimize language models. However, when multiple responses are available, these methods fail to fully utilize the ranking information provided by the reward model or human feedback. To this end, the paper proposes a new list - based method - **Ordinal Preference Optimization (OPO)**, which uses the widely - used ranking metric **Normalized Discounted Cumulative Gain (NDCG)** to better utilize the relative order information among multiple responses. By introducing a differentiable surrogate loss function to approximate NDCG, OPO can perform preference optimization in an end - to - end framework and establish a connection between the ranking model in information retrieval and the LLM alignment problem. ### Main contributions 1. **Propose a new list - based alignment method OPO**: This method can utilize multiple ordered responses and shows superior performance on models of different scales compared to existing pairwise and list - based methods. 2. **Establish a connection between the ranking model in information retrieval and the LLM alignment problem**: Demonstrate the effectiveness of directly optimizing ranking metrics for LLM alignment. 3. **Construct a dataset containing ordered multi - responses**: And prove that increasing the negative sample pool can improve the performance of existing pairwise methods. ### Specific problem description - **Resource - intensiveness and sensitivity**: The RLHF process is resource - intensive and very sensitive to hyperparameters. - **Limitations of pairwise comparisons**: Although existing pairwise comparison methods are effective, they cannot fully utilize the information of the entire response list when dealing with multi - response data. - **Nondifferentiability challenge**: As an important evaluation metric, the nondifferentiable nature of NDCG makes it difficult to be directly used for training. By introducing OPO, the paper provides a more comprehensive and effective solution that can better align LLMs with human preferences, thereby improving the generation quality. ### Summary of mathematical formulas - **Reward score calculation**: \[ s(x, y)=\beta \log \frac{\pi_{\theta}(y|x)}{\pi_{\text{ref}}(y|x)} \] where $\pi_{\theta}(y|x)$ and $\pi_{\text{ref}}(y|x)$ represent the probabilities of the policy model and the reference model generating response $y$ given prompt $x$, respectively. - **NDCG calculation**: \[ \text{DCG}@k = \sum_{j = 1}^{k}G(\psi_{j})D(\tau(j)) \] \[ \text{NDCG}@k=\frac{\text{DCG}@k}{\text{maxDCG}@k} \] where $G(\psi_{j}) = 2^{\psi_{j}}- 1$ is the gain function, $D(\tau(j))=\frac{1}{\log_{2}(\tau(j)+1)}$ is the discount function, and $\tau(j)$ is the ranking position after re - ranking based on the reward score calculated by the current model. - **NeuralNDCG approximation**: \[ \text{NeuralNDCG}@k(\tau; s, \Psi)=\frac{1}{\text{maxDCG}@k}\sum_{j = 1}^{k}(\text{scale}(bP)\cdot G(\Psi))_{j}\cdot D(j) \] where $\text{scale}(bP)$ represents the Sinkhorn scaling operation to ensure that the sum of each column is 1. Through these improvements, the OPO method can more effectively utilize the relative order information in multi - response data.