Dataset Reset Policy Optimization for RLHF

Jonathan D. Chang,Wenhao Zhan,Owen Oertell,Kianté Brantley,Dipendra Misra,Jason D. Lee,Wen Sun
2024-04-17
Abstract:Reinforcement Learning (RL) from Human Preference-based feedback is a popular paradigm for fine-tuning generative models, which has produced impressive models such as GPT-4 and Claude3 Opus. This framework often consists of two steps: learning a reward model from an offline preference dataset followed by running online RL to optimize the learned reward model. In this work, leveraging the idea of reset, we propose a new RLHF algorithm with provable guarantees. Motivated by the fact that offline preference dataset provides informative states (i.e., data that is preferred by the labelers), our new algorithm, Dataset Reset Policy Optimization (DR-PO), integrates the existing offline preference dataset into the online policy training procedure via dataset reset: it directly resets the policy optimizer to the states in the offline dataset, instead of always starting from the initial state distribution. In theory, we show that DR-PO learns to perform at least as good as any policy that is covered by the offline dataset under general function approximation with finite sample complexity. In experiments, we demonstrate that on both the TL;DR summarization and the Anthropic Helpful Harmful (HH) dataset, the generation from DR-PO is better than that from Proximal Policy Optimization (PPO) and Direction Preference Optimization (DPO), under the metric of GPT4 win-rate. Code for this work can be found at
Machine Learning,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
This paper focuses on algorithmic optimization of reinforcement learning with human feedback (RLHF), which adjusts the generative model through human preference feedback in the absence of an explicit reward function. The current RLHF framework typically consists of two steps: learning a reward model from offline preference data, and then using online RL to optimize this model. The paper proposes a new algorithm called "Dataset Reset Policy Optimization" (DR-PO), which utilizes the concept of "reset" to directly integrate the offline data into the online policy training process by resetting the policy optimizer to states in the offline data for exploration. In theory, DR-PO guarantees learning at least as good as any policy covered by the offline data with finite sample complexity. Experimental results demonstrate that DR-PO outperforms Proximal Policy Optimization (PPO) and Direction Preference Optimization (DPO) in terms of generation effectiveness on the TL;DR summary and Anthropic HH datasets, with GPT4 win rate as the evaluation metric. Furthermore, DR-PO exhibits scalability comparable to PPO on models of different scales, yet still surpasses baseline methods. The main contributions of the paper are as follows: 1. Proposing a new algorithm, DR-PO, that leverages reset to integrate offline data into online RLHF, which has theoretical performance guarantees and is computationally tractable. 2. Experimental results show that DR-PO surpasses PPO and DPO on standard RLHF benchmark tasks, and achieves empirical performance superior to PPO without additional computational or memory overhead. 3. Investigating how to utilize reset to optimize RLHF algorithms without relying on offline data, expanding the theoretical work in related fields. In conclusion, this paper aims to address the effective utilization of offline data to optimize reward models in reinforcement learning, and validates the effectiveness and advantages of the new algorithm through practical applications.