Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences

Corby Rosset,Ching-An Cheng,Arindam Mitra,Michael Santacroce,Ahmed Awadallah,Tengyang Xie

2024-04-05

Abstract:This paper studies post-training large language models (LLMs) using preference feedback from a powerful oracle to help a model iteratively improve over itself. The typical approach for post-training LLMs involves Reinforcement Learning from Human Feedback (RLHF), which traditionally separates reward learning and subsequent policy optimization. However, such a reward maximization approach is limited by the nature of "point-wise" rewards (such as Bradley-Terry model), which fails to express complex intransitive or cyclic preference relations. While advances on RLHF show reward learning and policy optimization can be merged into a single contrastive objective for stability, they yet still remain tethered to the reward maximization framework. Recently, a new wave of research sidesteps the reward maximization presumptions in favor of directly optimizing over "pair-wise" or general preferences. In this paper, we introduce Direct Nash Optimization (DNO), a provable and scalable algorithm that marries the simplicity and stability of contrastive learning with theoretical generality from optimizing general preferences. Because DNO is a batched on-policy algorithm using a regression-based objective, its implementation is straightforward and efficient. Moreover, DNO enjoys monotonic improvement across iterations that help it improve even over a strong teacher (such as GPT-4). In our experiments, a resulting 7B parameter Orca-2.5 model aligned by DNO achieves the state-of-the-art win-rate against GPT-4-Turbo of 33% on AlpacaEval 2.0 (even after controlling for response length), an absolute gain of 26% (7% to 33%) over the initializing model. It outperforms models with far more parameters, including Mistral Large, Self-Rewarding LM (70B parameters), and older versions of GPT-4.

Machine Learning,Artificial Intelligence,Computation and Language

What problem does this paper attempt to address?

This paper addresses the limitation of using preference feedback to improve Large Language Models (LLMs) after training. The traditional approach involves a two-step process of reward learning and subsequent policy optimization, but it is limited in expressing only simple rewards. The study proposes a new algorithm called Direct Nash Optimization (DNO), which combines the simplicity of contrastive learning and the universality of optimizing general preferences. DNO is a batch online algorithm that regresses between an internally rewarded function within the policy and the expected win rate, enabling monotonic improvements and surpassing powerful teacher models like GPT-4. Experimental results show that using DNO to align the Orca-2.5 model with 7B parameters achieves over a 33% win rate on AlpacaEval 2.0, a 26% improvement over the initialized model, and outperforms models with larger parameter sizes.

Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

Self-Play Preference Optimization for Language Model Alignment

Nash Learning from Human Feedback

Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both

Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model

New Desiderata for Direct Preference Optimization

Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level

Accelerated Preference Optimization for Large Language Model Alignment

Optimizing LLMs with Direct Preferences: A Data Efficiency Perspective

Active Preference Learning for Large Language Models

Just Say What You Want: Only-prompting Self-rewarding Online Preference Optimization

Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization

Ordinal Preference Optimization: Aligning Human Preferences via NDCG

Hybrid Preference Optimization: Augmenting Direct Preference Optimization with Auxiliary Objectives

Large Language Models as Optimizers

Reward Model Learning vs. Direct Policy Optimization: A Comparative Analysis of Learning from Human Preferences

Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization

$α$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs