Abstract:This study focuses on the topic of offline preference-based reinforcement learning (PbRL), a variant of conventional reinforcement learning that dispenses with the need for online interaction or specification of reward functions. Instead, the agent is provided with fixed offline trajectories and human preferences between pairs of trajectories to extract the dynamics and task information, respectively. Since the dynamics and task information are orthogonal, a naive approach would involve using preference-based reward learning followed by an off-the-shelf offline RL algorithm. However, this requires the separate learning of a scalar reward function, which is assumed to be an information bottleneck of the learning process. To address this issue, we propose the offline preference-guided policy optimization (OPPO) paradigm, which models offline trajectories and preferences in a one-step process, eliminating the need for separately learning a reward function. OPPO achieves this by introducing an offline hindsight information matching objective for optimizing a contextual policy and a preference modeling objective for finding the optimal context. OPPO further integrates a well-performing decision policy by optimizing the two objectives iteratively. Our empirical results demonstrate that OPPO effectively models offline preferences and outperforms prior competing baselines, including offline RL algorithms performed over either true or pseudo reward function specifications. Our code is available on the project website: https://sites.google.com/view/oppo-icml-2023 .

Sequential Classification-Based Optimization for Direct Policy Search.

Beyond Reward: Offline Preference-guided Policy Optimization

Successive Convex Approximation Based Off-Policy Optimization for Constrained Reinforcement Learning

Derivative-Free Optimization Via Classification.

Cautious Bayesian Optimization for Efficient and Scalable Policy Search

A Scalable Derivative-free Exploration Approach for Reinforcement Learning

QUANTILE-BASED POLICY OPTIMIZATION FOR REINFORCEMENT LEARNING

Proximal Policy Optimization and Its Dynamic Version for Sequence Generation.

Policy Optimization via Importance Sampling

Variance-Reduced Off-Policy Memory-Efficient Policy Search

Asynchronous classification-based optimization

Bayesian Sequential Optimal Experimental Design for Nonlinear Models Using Policy Gradient Reinforcement Learning

Deterministic Policy Optimization by Combining Pathwise and Score Function Estimators for Discrete Action Spaces

Monte Carlo Tree Search for Policy Optimization.

Reinforcement Learning Driven Heuristic Optimization

Policy Optimization by Genetic Distillation

Proximal Policy Optimization Algorithms

Optimistic Natural Policy Gradient: a Simple Efficient Policy Optimization Framework for Online RL

Trajectory-Oriented Policy Optimization with Sparse Rewards

Decentralized Policy Optimization

Direct Random Search for Fine Tuning of Deep Reinforcement Learning Policies