Abstract:This study focuses on the topic of offline preference-based reinforcement learning (PbRL), a variant of conventional reinforcement learning that dispenses with the need for online interaction or specification of reward functions. Instead, the agent is provided with fixed offline trajectories and human preferences between pairs of trajectories to extract the dynamics and task information, respectively. Since the dynamics and task information are orthogonal, a naive approach would involve using preference-based reward learning followed by an off-the-shelf offline RL algorithm. However, this requires the separate learning of a scalar reward function, which is assumed to be an information bottleneck of the learning process. To address this issue, we propose the offline preference-guided policy optimization (OPPO) paradigm, which models offline trajectories and preferences in a one-step process, eliminating the need for separately learning a reward function. OPPO achieves this by introducing an offline hindsight information matching objective for optimizing a contextual policy and a preference modeling objective for finding the optimal context. OPPO further integrates a well-performing decision policy by optimizing the two objectives iteratively. Our empirical results demonstrate that OPPO effectively models offline preferences and outperforms prior competing baselines, including offline RL algorithms performed over either true or pseudo reward function specifications. Our code is available on the project website: https://sites.google.com/view/oppo-icml-2023 .

Reward Shaping Based on Optimal-Policy-Free

Beyond Reward: Offline Preference-guided Policy Optimization

Potential-Based Reward Shaping For Intrinsic Motivation

Potential-Based Intrinsic Motivation: Preserving Optimality With Complex, Non-Markovian Shaping Rewards

A new Potential-Based Reward Shaping for Reinforcement Learning Agent

Bootstrapped Reward Shaping

Benchmarking Potential Based Rewards for Learning Humanoid Locomotion

On the Sample Efficiency of Abstractions and Potential-Based Reward Shaping in Reinforcement Learning

Learning to Utilize Shaping Rewards: A New Approach of Reward Shaping

BAMDP Shaping: a Unified Theoretical Framework for Intrinsic Motivation and Reward Shaping

Shaping Reward Learning Approach from Passive Samples

The Guiding Role of Reward Based on Phased Goal in Reinforcement Learning.

Hierarchical Potential-based Reward Shaping from Task Specifications

Provable Reward-Agnostic Preference-Based Reinforcement Learning

Highly Efficient Self-Adaptive Reward Shaping for Reinforcement Learning

Reward Shaping with Hierarchical Graph Topology

Learning to Shape Rewards Using a Game of Two Partners

Offline Reward Shaping with Scaling Human Preference Feedback for Deep Reinforcement Learning

Learning to Shape Rewards using a Game of Switching Controls

Reward Propagation Using Graph Convolutional Networks

Keeping Your Distance: Solving Sparse Reward Tasks Using Self-Balancing Shaped Rewards