Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint

Wei Xiong,Hanze Dong,Chenlu Ye,Ziqi Wang,Han Zhong,Heng Ji,Nan Jiang,Tong Zhang

2024-05-01

Abstract:This paper studies the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF). We first identify the primary challenges of existing popular methods like offline PPO and offline DPO as lacking in strategical exploration of the environment. Then, to understand the mathematical principle of RLHF, we consider a standard mathematical formulation, the reverse-KL regularized contextual bandit for RLHF. Despite its widespread practical application, a rigorous theoretical analysis of this formulation remains open. We investigate its behavior in three distinct settings -- offline, online, and hybrid -- and propose efficient algorithms with finite-sample theoretical guarantees. Moving towards practical applications, our framework, with a robust approximation of the information-theoretical policy improvement oracle, naturally gives rise to several novel RLHF algorithms. This includes an iterative version of the Direct Preference Optimization (DPO) algorithm for online settings, and a multi-step rejection sampling strategy for offline scenarios. Our empirical evaluations on real-world alignment experiment of large language model demonstrate that these proposed methods significantly surpass existing strong baselines, such as DPO and Rejection Sampling Optimization (RSO), showcasing the connections between solid theoretical foundations and their potent practical implementations.

Machine Learning,Artificial Intelligence

What problem does this paper attempt to address?

This paper mainly investigates the problem of model alignment in Reinforcement Learning from Human Feedback (RLHF) based on generative models. Existing methods such as offline PPO and offline DPO have been pointed out to have deficiencies in environmental policy exploration. The researchers understand the mathematical principles of RLHF through the backward KL regularization and contextual bandit problem. Although this formalization is widely used in practice, its theoretical analysis has not been sufficiently studied. The paper proposes algorithms in three different settings (offline, online, and mixed) and provides theoretical guarantees with limited samples. For offline learning, they use pessimistic estimation to improve efficiency. For online iterative learning, they propose a strategy based on batch exploration. Experimental results show that these new methods significantly outperform existing baselines such as DPO and RSO in alignment experiments of large-scale language models, demonstrating the connection between theoretical foundation and practical implementation. In summary, this paper attempts to address how to align generative models more effectively in RLHF, especially when considering the exploration of environmental policies and the imperfection of reward functions. It also explores how to guide practical algorithm design through theoretical analysis to reduce overfitting and reward hacking issues and improve model performance.

Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint

Beyond Reward: Offline Preference-guided Policy Optimization

Online Iterative Reinforcement Learning from Human Feedback with General Preference Model

Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF

RLHF Workflow: From Reward Modeling to Online RLHF

Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

Reinforcement Learning from Human Feedback with Active Queries

RRHF: Rank Responses to Align Language Models with Human Feedback

Dual Active Learning for Reinforcement Learning from Human Feedback

SAIL: Self-Improving Efficient Online Alignment of Large Language Models

RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation

On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization

Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Active Preference Optimization for Sample Efficient RLHF

A Unified Linear Programming Framework for Offline Reward Learning from Human Demonstrations and Feedback

Accelerated Preference Optimization for Large Language Model Alignment

Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer

DPO Meets PPO: Reinforced Token Optimization for RLHF

Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment