Abstract:Reinforcement Learning from Human Feedback (RLHF) has achieved impressive empirical successes while relying on a small amount of human feedback. However, there is limited theoretical justification for this phenomenon. Additionally, most recent studies focus on value-based algorithms despite the recent empirical successes of policy-based algorithms. In this work, we consider an RLHF algorithm based on policy optimization (PO-RLHF). The algorithm is based on the popular Policy Cover-Policy Gradient (PC-PG) algorithm, which assumes knowledge of the reward function. In PO-RLHF, knowledge of the reward function is not assumed, and the algorithm uses trajectory-based comparison feedback to infer the reward function. We provide performance bounds for PO-RLHF with low query complexity, which provides insight into why a small amount of human feedback may be sufficient to achieve good performance with RLHF. A key novelty is a trajectory-level elliptical potential analysis, which bounds the reward estimation error when comparison feedback (rather than numerical reward observation) is given. We provide and analyze algorithms PG-RLHF and NN-PG-RLHF for two settings: linear and neural function approximation, respectively.

What problem does this paper attempt to address?

This paper discusses the exploration-driven policy optimization problem in reinforcement learning from human feedback (RLHF). In RLHF, the agent does not directly observe rewards but learns from human preference feedback on trajectories. Despite the high efficiency of RLHF in practice, there is limited theoretical understanding, and most research has focused on value-based algorithms. The paper proposes a policy optimization-based RLHF algorithm (PO-RLHF), which does not assume a known reward function but infers the reward function using trajectory-based comparative feedback. By conducting an analysis using ellipsoid potential, the paper quantifies the reward estimation error when only comparative feedback rather than numerical rewards are observed. The authors design two algorithms, namely PG-RLHF and NN-PG-RLHF, for linear function approximation and neural function approximation, respectively. These algorithms effectively explore unknown environments and collect human data based on exploration. The main contributions of the paper include: 1. Investigation of policy optimization in RLHF with exploration and active human feedback collection, and theoretical explanation of the practical efficiency of RLHF. 2. Design of efficient algorithms for linear and neural function approximation environments. 3. Development of new analysis techniques, such as trajectory-level ellipsoid potential argument and guarantees for maximum likelihood estimation in the case of neural approximation. 4. Proof that the amount of data required in RLHF is only a small fraction of that required in standard reinforcement learning. Through this work, the paper provides a theoretical foundation for the efficiency of RLHF and insights into how RLHF can succeed with minimal human feedback.

Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on Efficient Data Utilization

Beyond Reward: Offline Preference-guided Policy Optimization

Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF

Policy Optimization in RLHF: The Impact of Out-of-preference Data

Reward Model Learning vs. Direct Policy Optimization: A Comparative Analysis of Learning from Human Preferences

Provably Efficient Exploration in Policy Optimization

Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint

Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference

Trajectory-Oriented Policy Optimization with Sparse Rewards

Policy Optimization with Smooth Guidance Learned from State-Only Demonstrations

Discovered Policy Optimisation

DPO Meets PPO: Reinforced Token Optimization for RLHF

Reinforcement Learning Driven Heuristic Optimization

Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

Dataset Reset Policy Optimization for RLHF

Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation.

Policy Optimization for Continuous Reinforcement Learning

Optimistic Natural Policy Gradient: a Simple Efficient Policy Optimization Framework for Online RL

QUANTILE-BASED POLICY OPTIMIZATION FOR REINFORCEMENT LEARNING

Reflective Policy Optimization