Abstract:Reinforcement Learning (RL) from Human Preference-based feedback is a popular paradigm for fine-tuning generative models, which has produced impressive models such as GPT-4 and Claude3 Opus. This framework often consists of two steps: learning a reward model from an offline preference dataset followed by running online RL to optimize the learned reward model. In this work, leveraging the idea of reset, we propose a new RLHF algorithm with provable guarantees. Motivated by the fact that offline preference dataset provides informative states (i.e., data that is preferred by the labelers), our new algorithm, Dataset Reset Policy Optimization (DR-PO), integrates the existing offline preference dataset into the online policy training procedure via dataset reset: it directly resets the policy optimizer to the states in the offline dataset, instead of always starting from the initial state distribution. In theory, we show that DR-PO learns to perform at least as good as any policy that is covered by the offline dataset under general function approximation with finite sample complexity. In experiments, we demonstrate that on both the TL;DR summarization and the Anthropic Helpful Harmful (HH) dataset, the generation from DR-PO is better than that from Proximal Policy Optimization (PPO) and Direction Preference Optimization (DPO), under the metric of GPT4 win-rate. Code for this work can be found at

What problem does this paper attempt to address?

This paper focuses on algorithmic optimization of reinforcement learning with human feedback (RLHF), which adjusts the generative model through human preference feedback in the absence of an explicit reward function. The current RLHF framework typically consists of two steps: learning a reward model from offline preference data, and then using online RL to optimize this model. The paper proposes a new algorithm called "Dataset Reset Policy Optimization" (DR-PO), which utilizes the concept of "reset" to directly integrate the offline data into the online policy training process by resetting the policy optimizer to states in the offline data for exploration. In theory, DR-PO guarantees learning at least as good as any policy covered by the offline data with finite sample complexity. Experimental results demonstrate that DR-PO outperforms Proximal Policy Optimization (PPO) and Direction Preference Optimization (DPO) in terms of generation effectiveness on the TL;DR summary and Anthropic HH datasets, with GPT4 win rate as the evaluation metric. Furthermore, DR-PO exhibits scalability comparable to PPO on models of different scales, yet still surpasses baseline methods. The main contributions of the paper are as follows: 1. Proposing a new algorithm, DR-PO, that leverages reset to integrate offline data into online RLHF, which has theoretical performance guarantees and is computationally tractable. 2. Experimental results show that DR-PO surpasses PPO and DPO on standard RLHF benchmark tasks, and achieves empirical performance superior to PPO without additional computational or memory overhead. 3. Investigating how to utilize reset to optimize RLHF algorithms without relying on offline data, expanding the theoretical work in related fields. In conclusion, this paper aims to address the effective utilization of offline data to optimize reward models in reinforcement learning, and validates the effectiveness and advantages of the new algorithm through practical applications.

Dataset Reset Policy Optimization for RLHF

Beyond Reward: Offline Preference-guided Policy Optimization

Behavior Proximal Policy Optimization

DROP: Conservative Model-based Optimization for Offline Reinforcement Learning

DPO Meets PPO: Reinforced Token Optimization for RLHF

Policy Optimization in RLHF: The Impact of Out-of-preference Data

Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on Efficient Data Utilization

Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint

CROP: Conservative Reward for Model-based Offline Policy Optimization

Boosting Offline Reinforcement Learning via Data Rebalancing

MOPO: Model-based Offline Policy Optimization

Self-Improving Robust Preference Optimization

Optimistic Model Rollouts for Pessimistic Offline Policy Optimization

REBEL: Reinforcement Learning via Regressing Relative Rewards

Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF

Reinforcement Learning Driven Heuristic Optimization

Model-Based Offline Adaptive Policy Optimization with Episodic Memory

Reinforcement Learning with Human Feedback: Learning Dynamic Choices via Pessimism

Offline Retraining for Online RL: Decoupled Policy Learning to Mitigate Exploration Bias

PROTO: Iterative Policy Regularized Offline-to-Online Reinforcement Learning