Abstract:This study focuses on the topic of offline preference-based reinforcement learning (PbRL), a variant of conventional reinforcement learning that dispenses with the need for online interaction or specification of reward functions. Instead, the agent is provided with fixed offline trajectories and human preferences between pairs of trajectories to extract the dynamics and task information, respectively. Since the dynamics and task information are orthogonal, a naive approach would involve using preference-based reward learning followed by an off-the-shelf offline RL algorithm. However, this requires the separate learning of a scalar reward function, which is assumed to be an information bottleneck of the learning process. To address this issue, we propose the offline preference-guided policy optimization (OPPO) paradigm, which models offline trajectories and preferences in a one-step process, eliminating the need for separately learning a reward function. OPPO achieves this by introducing an offline hindsight information matching objective for optimizing a contextual policy and a preference modeling objective for finding the optimal context. OPPO further integrates a well-performing decision policy by optimizing the two objectives iteratively. Our empirical results demonstrate that OPPO effectively models offline preferences and outperforms prior competing baselines, including offline RL algorithms performed over either true or pseudo reward function specifications. Our code is available on the project website: https://sites.google.com/view/oppo-icml-2023 .

Efficient Preference-Based Reinforcement Learning Using Learned Dynamics Models

Beyond Reward: Offline Preference-guided Policy Optimization

STRAPPER: Preference-based Reinforcement Learning via Self-training Augmentation and Peer Regularization

Sample-Efficient Preference-based Reinforcement Learning with Dynamics Aware Rewards

Direct Preference-based Policy Optimization without Reward Modeling

On-Robot Bayesian Reinforcement Learning for POMDPs

RIME: Robust Preference-based Reinforcement Learning with Noisy Preferences

Meta-Reward-Net: Implicitly Differentiable Reward Learning for Preference-based Reinforcement Learning

Provable Reward-Agnostic Preference-Based Reinforcement Learning

Preference-based Reinforcement Learning with Finite-Time Guarantees

Hybrid Reinforcement Learning Based on Human Preference and Advice for Efficient Robot Skill Learning

Sample-Efficient Reinforcement Learning Based on Dynamics Models via Meta-policy Optimization

Online Policy Learning from Offline Preferences

Beyond Human Preferences: Exploring Reinforcement Learning Trajectory Evaluation and Improvement through LLMs

Learning Dynamics Models for Model Predictive Agents

Efficient Preference-based Reinforcement Learning via Aligned Experience Estimation

Personalization in Human-Robot Interaction through Preference-based Action Representation Learning

Advances in Preference-based Reinforcement Learning: A Review

PrefCLM: Enhancing Preference-based Reinforcement Learning with Crowdsourced Large Language Models

Live in the Moment: Learning Dynamics Model Adapted to Evolving Policy