Abstract:Reinforcement Learning with Human Feedback (RLHF) has achieved great success in aligning large language models (LLMs) with human preferences. Prevalent RLHF approaches are reward-based, following the Bradley-Terry (BT) model assumption, which may not fully capture the complexity of human preferences. In this paper, we explore RLHF under a general preference framework and approach it from a game-theoretic perspective. Specifically, we formulate the problem as a two-player game and propose a novel online algorithm, iterative Nash policy optimization (INPO). The key idea is to let the policy play against itself via no-regret learning, thereby approximating the Nash policy. Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses, which typically incurs high computational or annotation costs. Instead, we introduce a new loss objective that is directly minimized over a preference dataset. We provide theoretical analysis for our approach and demonstrate its effectiveness through experiments on various representative benchmarks. With an LLaMA-3-8B-based SFT model, INPO achieves a 42.6% length-controlled win rate on AlpacaEval 2.0 and a 37.8% win rate on Arena-Hard, showing substantial improvement over the state-of-the-art online RLHF algorithms.

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper aims to solve the problem of aligning large language models (LLM) with human preferences. Specifically, existing reward - based reinforcement learning from human feedback (RLHF) methods usually assume that human preferences can be modeled by the Bradley - Terry (BT) model, which may not fully capture the complexity of human preferences. To overcome this limitation, the paper proposes a new online algorithm - Iterative Nash Policy Optimization (INPO), which solves the alignment problem between LLM and human preferences from a game - theory perspective under a general preference framework. ### Main contributions 1. **Propose a new algorithm**: The paper proposes a new online algorithm - Iterative Nash Policy Optimization (INPO), which approximates the Nash strategy through no - regret learning. Unlike existing methods, INPO does not need to estimate the expected winning rate of each response, thus avoiding high computational or labeling costs. 2. **Theoretical analysis**: The author provides a theoretical analysis of the algorithm, proving that INPO can approximate the Nash strategy with an iteration complexity of \( \tilde{O}\left(\frac{1}{\epsilon^2}\right) \) and converges to the Nash strategy at a rate of \( O\left(\frac{1}{T}\right) \) in the last iteration step. 3. **Experimental verification**: Through experiments on multiple benchmark datasets, the author demonstrates the effectiveness of INPO. In particular, using the SFT model of LLaMA - 3 - 8B, INPO achieves a 42.6% length - control winning rate on AlpacaEval 2.0 and a 37.8% winning rate on Arena - Hard, significantly outperforming existing online RLHF algorithms. ### Method overview 1. **Problem definition**: The paper defines the alignment problem between LLM and human preferences as a two - player game, where one player (max - player) aims to maximize its winning rate relative to the other player (min - player) while not deviating too far from the reference strategy. 2. **Algorithm design**: The INPO algorithm is based on the Online Mirror Descent (OMD) algorithm and directly minimizes the loss on the preference dataset by introducing a new loss objective function. Specifically, the algorithm generates response pairs in each iteration step, queries the preference oracle for preference signals, and then updates the strategy by minimizing the loss objective function. 3. **Theoretical guarantee**: The author proves that the OMD algorithm has sublinear regret under Assumption A (bounded log - density ratio), and the uniformly mixed strategy has a good upper bound on the duality gap. In addition, the algorithm converges to the Nash strategy at a rate of \( O\left(\frac{1}{T}\right) \) in the last iteration step. ### Experimental results 1. **Main results**: On three widely used benchmark datasets (MT - Bench, AlpacaEval 2.0, Arena - Hard v0.1), INPO outperforms the baseline methods on all benchmarks, especially on AlpacaEval 2.0 and Arena - Hard v0.1. 2. **More academic benchmarks**: On six academic benchmark datasets (IFEval, GPQA, MMLU, Hellaswag, TruthfulQA, GSM8K), INPO also performs well, indicating that the RLHF alignment method has a positive impact on reasoning, calibration, and generating accurate responses. 3. **Ablation study**: Through the ablation study, the author verifies that adding the KL regularization term to the objective function is beneficial for improving the alignment performance. ### Related work 1. **Reward - based RLHF**: Traditional RLHF methods are usually based on reward models and use algorithms such as PPO to maximize the KL - regularized objective function. Recent research has proposed methods to directly optimize strategies on preference datasets, such as the DPO algorithm.

Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

Beyond Reward: Offline Preference-guided Policy Optimization

Self-Play Preference Optimization for Language Model Alignment

Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint

Nash Learning from Human Feedback

Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences

Accelerated Preference Optimization for Large Language Model Alignment

A theoretical analysis of nash learning from human feedback under general kl-regularized preference

Toward Optimal LLM Alignments Using Two-Player Games

Fine-Tuning Language Models with Advantage-Induced Policy Alignment

Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer

COMAL: A Convergent Meta-Algorithm for Aligning LLMs with General Preferences

Online Iterative Reinforcement Learning from Human Feedback with General Preference Model

On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization

MaxMin-RLHF: Alignment with Diverse Human Preferences

Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment

The Perfect Blend: Redefining RLHF with Mixture of Judges

Just Say What You Want: Only-prompting Self-rewarding Online Preference Optimization

Direct Alignment of Language Models via Quality-Aware Self-Refinement

Fine-Tuning Language Models with Reward Learning on Policy