Abstract:Aligning language models (LMs) with curated human feedback is critical to control their behaviors in real-world applications. Several recent policy optimization methods, such as DPO and SLiC, serve as promising alternatives to the traditional Reinforcement Learning from Human Feedback (RLHF) approach. In practice, human feedback often comes in a format of a ranked list over multiple responses to amortize the cost of reading prompt. Multiple responses can also be ranked by reward models or AI feedback. There lacks such a thorough study on directly fitting upon a list of responses. In this work, we formulate the LM alignment as a \textit{listwise} ranking problem and describe the LiPO framework, where the policy can potentially learn more effectively from a ranked list of plausible responses given the prompt. This view draws an explicit connection to Learning-to-Rank (LTR), where most existing preference optimization work can be mapped to existing ranking objectives. Following this connection, we provide an examination of ranking objectives that are not well studied for LM alignment with DPO and SLiC as special cases when list size is two. In particular, we highlight a specific method, LiPO-$\lambda$, which leverages a state-of-the-art \textit{listwise} ranking objective and weights each preference pair in a more advanced manner. We show that LiPO-$\lambda$ can outperform DPO variants and SLiC by a clear margin on several preference alignment tasks with both curated and real rankwise preference data.

What problem does this paper attempt to address?

This paper explores how to optimize the preferences of language models (LM) through Learning-to-Rank, in order to align better with human feedback. Current methods, such as DPO and SLiC, primarily optimize based on paired preference data. However, in practice, human feedback often appears in the form of ranked lists, which reduces the reading cost. The paper proposes a new framework called LiPO (Listwise Preference Optimization), which transforms the alignment problem of LM into a list ranking problem, allowing more efficient learning from multiple possible responses. Under the LiPO framework, the authors study different ranking objectives, with a particular emphasis on a method called LiPO-λ. LiPO-λ utilizes advanced list ranking objectives and weights each preference pair in a more sophisticated way. Experimental results show that LiPO-λ outperforms variations of DPO and SLiC in multi-preference alignment tasks, whether using artificial or real ranking data. The paper also points out that existing preference optimization methods mostly overlook the ordering information within the list, focusing only on the optimal pairwise or list ranking while ignoring the label values. LiPO-λ addresses these issues by considering label values and using Lambda weights to handle each pair in the list, taking into account the influence of other items within the list. Through experiments on different tasks, including Reddit TL;DR and AnthropicHH dialogue dataset, LiPO-λ demonstrates superior performance compared to other methods, including pointwise, pairwise, and listwise loss functions. These experimental results indicate the critical importance of list-format data and label value information in preference optimization.

LiPO: Listwise Preference Optimization through Learning-to-Rank

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model

Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment

Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts

RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models

WPO: Enhancing RLHF with Weighted Preference Optimization

LIRE: listwise reward enhancement for preference alignment

AIPO: Improving Training Objective for Iterative Preference Optimization

Accelerated Preference Optimization for Large Language Model Alignment

$α$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs

Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness

Aligning CodeLLMs with Direct Preference Optimization

ROPO: Robust Preference Optimization for Large Language Models

Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization

Direct Alignment of Language Models via Quality-Aware Self-Refinement

Statistical Rejection Sampling Improves Preference Optimization

MallowsPO: Fine-Tune Your LLM with Preference Dispersions

DPO Meets PPO: Reinforced Token Optimization for RLHF