Abstract:This study focuses on the topic of offline preference-based reinforcement learning (PbRL), a variant of conventional reinforcement learning that dispenses with the need for online interaction or specification of reward functions. Instead, the agent is provided with fixed offline trajectories and human preferences between pairs of trajectories to extract the dynamics and task information, respectively. Since the dynamics and task information are orthogonal, a naive approach would involve using preference-based reward learning followed by an off-the-shelf offline RL algorithm. However, this requires the separate learning of a scalar reward function, which is assumed to be an information bottleneck of the learning process. To address this issue, we propose the offline preference-guided policy optimization (OPPO) paradigm, which models offline trajectories and preferences in a one-step process, eliminating the need for separately learning a reward function. OPPO achieves this by introducing an offline hindsight information matching objective for optimizing a contextual policy and a preference modeling objective for finding the optimal context. OPPO further integrates a well-performing decision policy by optimizing the two objectives iteratively. Our empirical results demonstrate that OPPO effectively models offline preferences and outperforms prior competing baselines, including offline RL algorithms performed over either true or pseudo reward function specifications. Our code is available on the project website: https://sites.google.com/view/oppo-icml-2023 .

Semi-Offline Reinforcement Learning for Optimized Text Generation

Beyond Reward: Offline Preference-guided Policy Optimization

DROP: Conservative Model-based Optimization for Offline Reinforcement Learning

Offline RL for Natural Language Generation with Implicit Language Q Learning

Towards Data-Driven Offline Simulations for Online Reinforcement Learning

Adaptive Policy Learning for Offline-to-Online Reinforcement Learning

Efficient Online Reinforcement Learning with Offline Data

Semi-supervised reward learning for offline reinforcement learning

A Survey on Offline Reinforcement Learning: Taxonomy, Review, and Open Problems

Offline Deep Reinforcement Learning Two-stage Optimization Framework Applied to Recommendation Systems

Improving Offline Reinforcement Learning with Inaccurate Simulators

Boosting Offline Reinforcement Learning with Residual Generative Modeling

Deploying Offline Reinforcement Learning with Human Feedback

Reward-agnostic Fine-tuning: Provable Statistical Benefits of Hybrid Reinforcement Learning

Bayesian Design Principles for Offline-to-Online Reinforcement Learning

Leveraging Offline Data in Online Reinforcement Learning

Text Generation by Learning from Demonstrations

A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning

Finetuning from Offline Reinforcement Learning: Challenges, Trade-offs and Practical Solutions

Hundreds Guide Millions: Adaptive Offline Reinforcement Learning with Expert Guidance

Unsupervised-to-Online Reinforcement Learning