LiPO: Listwise Preference Optimization through Learning-to-Rank

Tianqi Liu,Zhen Qin,Junru Wu,Jiaming Shen,Misha Khalman,Rishabh Joshi,Yao Zhao,Mohammad Saleh,Simon Baumgartner,Jialu Liu,Peter J. Liu,Xuanhui Wang
2024-05-23
Abstract:Aligning language models (LMs) with curated human feedback is critical to control their behaviors in real-world applications. Several recent policy optimization methods, such as DPO and SLiC, serve as promising alternatives to the traditional Reinforcement Learning from Human Feedback (RLHF) approach. In practice, human feedback often comes in a format of a ranked list over multiple responses to amortize the cost of reading prompt. Multiple responses can also be ranked by reward models or AI feedback. There lacks such a thorough study on directly fitting upon a list of responses. In this work, we formulate the LM alignment as a \textit{listwise} ranking problem and describe the LiPO framework, where the policy can potentially learn more effectively from a ranked list of plausible responses given the prompt. This view draws an explicit connection to Learning-to-Rank (LTR), where most existing preference optimization work can be mapped to existing ranking objectives. Following this connection, we provide an examination of ranking objectives that are not well studied for LM alignment with DPO and SLiC as special cases when list size is two. In particular, we highlight a specific method, LiPO-$\lambda$, which leverages a state-of-the-art \textit{listwise} ranking objective and weights each preference pair in a more advanced manner. We show that LiPO-$\lambda$ can outperform DPO variants and SLiC by a clear margin on several preference alignment tasks with both curated and real rankwise preference data.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
This paper explores how to optimize the preferences of language models (LM) through Learning-to-Rank, in order to align better with human feedback. Current methods, such as DPO and SLiC, primarily optimize based on paired preference data. However, in practice, human feedback often appears in the form of ranked lists, which reduces the reading cost. The paper proposes a new framework called LiPO (Listwise Preference Optimization), which transforms the alignment problem of LM into a list ranking problem, allowing more efficient learning from multiple possible responses. Under the LiPO framework, the authors study different ranking objectives, with a particular emphasis on a method called LiPO-λ. LiPO-λ utilizes advanced list ranking objectives and weights each preference pair in a more sophisticated way. Experimental results show that LiPO-λ outperforms variations of DPO and SLiC in multi-preference alignment tasks, whether using artificial or real ranking data. The paper also points out that existing preference optimization methods mostly overlook the ordering information within the list, focusing only on the optimal pairwise or list ranking while ignoring the label values. LiPO-λ addresses these issues by considering label values and using Lambda weights to handle each pair in the list, taking into account the influence of other items within the list. Through experiments on different tasks, including Reddit TL;DR and AnthropicHH dialogue dataset, LiPO-λ demonstrates superior performance compared to other methods, including pointwise, pairwise, and listwise loss functions. These experimental results indicate the critical importance of list-format data and label value information in preference optimization.