Abstract:It is desirable for policies to optimistically explore new states and behaviors during online reinforcement learning (RL) or fine-tuning, especially when prior offline data does not provide enough state coverage. However, exploration bonuses can bias the learned policy, and our experiments find that naive, yet standard use of such bonuses can fail to recover a performant policy. Concurrently, pessimistic training in offline RL has enabled recovery of performant policies from static datasets. Can we leverage offline RL to recover better policies from online interaction? We make a simple observation that a policy can be trained from scratch on all interaction data with pessimistic objectives, thereby decoupling the policies used for data collection and for evaluation. Specifically, we propose offline retraining, a policy extraction step at the end of online fine-tuning in our Offline-to-Online-to-Offline (OOO) framework for reinforcement learning (RL). An optimistic (exploration) policy is used to interact with the environment, and a separate pessimistic (exploitation) policy is trained on all the observed data for evaluation. Such decoupling can reduce any bias from online interaction (intrinsic rewards, primacy bias) in the evaluation policy, and can allow more exploratory behaviors during online interaction which in turn can generate better data for exploitation. OOO is complementary to several offline-to-online RL and online RL methods, and improves their average performance by 14% to 26% in our fine-tuning experiments, achieves state-of-the-art performance on several environments in the D4RL benchmarks, and improves online RL performance by 165% on two OpenAI gym environments. Further, OOO can enable fine-tuning from incomplete offline datasets where prior methods can fail to recover a performant policy. Implementation: <a class="link-external link-https" href="https://github.com/MaxSobolMark/OOO" rel="external noopener nofollow">this https URL</a>

Uncertainty-based Bootstrapped Optimization for Offline Reinforcement Learning

DROP: Conservative Model-based Optimization for Offline Reinforcement Learning

Behavior Proximal Policy Optimization

Beyond Reward: Offline Preference-guided Policy Optimization

Towards Robust Offline-to-Online Reinforcement Learning via Uncertainty and Smoothness

Uncertainty-Aware Data Augmentation for Offline Reinforcement Learning

UAC: Offline Reinforcement Learning with Uncertain Action Constraint

Uncertainty-aware Distributional Offline Reinforcement Learning

Offline-Boosted Actor-Critic: Adaptively Blending Optimal Historical Behaviors in Deep Off-Policy RL

Diverse Randomized Value Functions: A Provably Pessimistic Approach for Offline Reinforcement Learning

A Simple Unified Uncertainty-Guided Framework for Offline-to-Online Reinforcement Learning

Enhancing OOD Generalization in Offline Reinforcement Learning with Energy-Based Policy Optimization

Planning to Go Out-of-Distribution in Offline-to-Online Reinforcement Learning

MUSBO: Model-based Uncertainty Regularized and Sample Efficient Batch Optimization for Deployment Constrained Reinforcement Learning

Towards Robust Off-Policy Learning for Runtime Uncertainty

DARL: Distance-Aware Uncertainty Estimation for Offline Reinforcement Learning.

Robust Offline Reinforcement Learning from Low-Quality Data

Bayesian Design Principles for Offline-to-Online Reinforcement Learning

Deterministic Uncertainty Propagation for Improved Model-Based Offline Reinforcement Learning

Offline Retraining for Online RL: Decoupled Policy Learning to Mitigate Exploration Bias

Uncertainty-driven Trajectory Truncation for Data Augmentation in Offline Reinforcement Learning