Abstract:The dominant framework for alignment of large language models (LLM), whether through reinforcement learning from human feedback or direct preference optimisation, is to learn from preference data. This involves building datasets where each element is a quadruplet composed of a prompt, two independent responses (completions of the prompt) and a human preference between the two independent responses, yielding a preferred and a dis-preferred response. Such data is typically scarce and expensive to collect. On the other hand, \emph{single-trajectory} datasets where each element is a triplet composed of a prompt, a response and a human feedback is naturally more abundant. The canonical element of such datasets is for instance an LLM's response to a user's prompt followed by a user's feedback such as a thumbs-up/down. Consequently, in this work, we propose DRO, or \emph{Direct Reward Optimisation}, as a framework and associated algorithms that do not require pairwise preferences. DRO uses a simple mean-squared objective that can be implemented in various ways. We validate our findings empirically, using T5 encoder-decoder language models, and show DRO's performance over selected baselines such as Kahneman-Tversky Optimization (KTO). Thus, we confirm that DRO is a simple and empirically compelling method for single-trajectory policy optimisation.

What problem does this paper attempt to address?

This paper mainly discusses how to optimize the alignment of large language models (LLM) to better align with human preferences. The current commonly used methods include reinforcement learning from human feedback (RLHF) or direct preference optimization, which require preference data, i.e., a set of data consisting of prompts, two independent responses, and human preferences. However, collecting such data is both expensive and scarce. The researchers propose a new framework called Direct Reward Optimization (DRO), which does not require paired preference data but instead uses a simple mean square objective function that can be achieved in multiple ways. DRO is experimented on the T5 encoder-decoder language model, showing better performance than the selected baselines such as Kahneman-Tversky optimization (KTO). The paper also points out that single-trajectory data (containing prompts, responses, and human feedback triplets) is richer and easier to collect, thus more suitable for training. DRO aims to utilize this type of data for offline single-trajectory policy optimization. In addition, DRO is mathematically equivalent to traditional RL methods but avoids explicit learning and sampling of reward signals, reducing technical challenges during training. The main contributions of the paper are: 1. Introducing the DRO framework for single-trajectory RLHF optimization using a simple quadratic objective. 2. Proposing DRO-V, which combines offline policy learning with value function learning. 3. Comparing DRO-V with KTO and demonstrating the superiority of DRO-V on the T5 model. Through these methods, the paper addresses the problem of relying on expensive and scarce preference data and optimizing LLM behavior alignment, proposing a simple and effective algorithm suitable for offline single-trajectory settings.

Offline Regularised Reinforcement Learning for Large Language Models Alignment

SAIL: Self-Improving Efficient Online Alignment of Large Language Models

Self-Play with Adversarial Critic: Provable and Scalable Offline Alignment for Language Models

Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both

Aligning Language Models with Offline Learning from Human Feedback

Active Preference Learning for Large Language Models

Data-Efficient Alignment of Large Language Models with Human Feedback Through Natural Language

Learn Your Reference Model for Real Good Alignment

RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Aligning Large Language Models via Fine-grained Supervision

LIRE: listwise reward enhancement for preference alignment

Human Alignment of Large Language Models through Online Preference Optimisation

The Real, the Better: Aligning Large Language Models with Online Human Behaviors

Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint

Aligning Crowd Feedback via Distributional Preference Reward Modeling

On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization

Fine-Tuning Language Models with Advantage-Induced Policy Alignment

Dual Active Learning for Reinforcement Learning from Human Feedback

Towards Efficient Exact Optimization of Language Model Alignment

Statistical Rejection Sampling Improves Preference Optimization