Abstract:This study focuses on the topic of offline preference-based reinforcement learning (PbRL), a variant of conventional reinforcement learning that dispenses with the need for online interaction or specification of reward functions. Instead, the agent is provided with fixed offline trajectories and human preferences between pairs of trajectories to extract the dynamics and task information, respectively. Since the dynamics and task information are orthogonal, a naive approach would involve using preference-based reward learning followed by an off-the-shelf offline RL algorithm. However, this requires the separate learning of a scalar reward function, which is assumed to be an information bottleneck of the learning process. To address this issue, we propose the offline preference-guided policy optimization (OPPO) paradigm, which models offline trajectories and preferences in a one-step process, eliminating the need for separately learning a reward function. OPPO achieves this by introducing an offline hindsight information matching objective for optimizing a contextual policy and a preference modeling objective for finding the optimal context. OPPO further integrates a well-performing decision policy by optimizing the two objectives iteratively. Our empirical results demonstrate that OPPO effectively models offline preferences and outperforms prior competing baselines, including offline RL algorithms performed over either true or pseudo reward function specifications. Our code is available on the project website: https://sites.google.com/view/oppo-icml-2023 .

Proximal policy optimization with model-based methods

Policy Optimization with Model-based Explorations

Behavior Proximal Policy Optimization

Beyond Reward: Offline Preference-guided Policy Optimization

Model-Based Reinforcement Learning via Proximal Policy Optimization

Beyond the Boundaries of Proximal Policy Optimization

Proximal Policy Optimization Algorithms

Truly Proximal Policy Optimization

Proximal policy optimization via enhanced exploration efficiency

Fast-PPO: Proximal Policy Optimization with Optimal Baseline Method

PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation

Proximal Policy Optimization with Mixed Distributed Training

Proximal Policy Optimization Smoothed Algorithm

A Theoretical Analysis of Optimistic Proximal Policy Optimization in Linear Markov Decision Processes

Fast Proximal Policy Optimization

Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment

Transductive Off-policy Proximal Policy Optimization

Deep Model-Based Reinforcement Learning via Estimated Uncertainty and Conservative Policy Optimization

Bidirectional Model-based Policy Optimization

Authentic Boundary Proximal Policy Optimization

The Surprising Effectiveness of PPO in Cooperative, Multi-Agent Games