Abstract:Offline reinforcement learning (RL) is a data-driven learning paradigm for sequential decision making. Mitigating the overestimation of values originating from out-of-distribution (OOD) states induced by the distribution shift between the learning policy and the previously-collected offline dataset lies at the core of offline RL. To tackle this problem, some methods underestimate the values of states given by learned dynamics models or state-action pairs with actions sampled from policies different from the behavior policy. However, since these generated states or state-action pairs are not guaranteed to be OOD, staying conservative on them may adversely affect the in-distribution ones. In this paper, we propose an OOD state-conservative offline RL method (OSCAR), which aims to address the limitation by explicitly generating reliable OOD states that are located near the manifold of the offline dataset, and then design a conservative policy evaluation approach that combines the vanilla Bellman error with a regularization term that only underestimates the values of these generated OOD states. In this way, we can prevent the value errors of OOD states from propagating to in-distribution states through value bootstrapping and policy improvement. We also theoretically prove that the proposed conservative policy evaluation approach guarantees to underestimate the values of OOD states. OSCAR along with several strong baselines is evaluated on the offline decision-making benchmarks D4RL and autonomous driving benchmark SMARTS. Experimental results show that OSCAR outperforms the baselines on a large portion of the benchmarks and attains the highest average return, substantially outperforming existing offline RL methods.

ORAD: a New Framework of Offline Reinforcement Learning with Q-value Regularization

DROP: Conservative Model-based Optimization for Offline Reinforcement Learning

Beyond Reward: Offline Preference-guided Policy Optimization

DARA: Dynamics-Aware Reward Augmentation in Offline Reinforcement Learning

A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning

OSCAR: OOD State-Conservative Offline Reinforcement Learning for Sequential Decision Making

Adaptive Policy Learning for Offline-to-Online Reinforcement Learning

Robust Offline Reinforcement Learning from Low-Quality Data

Offline Reinforcement Learning With Behavior Value Regularization

Efficient Offline Reinforcement Learning With Relaxed Conservatism

Offline RL with No OOD Actions: In-Sample Learning Via Implicit Value Regularization

Interpretable performance analysis towards offline reinforcement learning: A dataset perspective

Augmenting Offline RL with Unlabeled Data

Believe What You See: Implicit Constraint Approach for Offline Multi-Agent Reinforcement Learning

Mildly Conservative Q-Learning for Offline Reinforcement Learning

Value Function Evaluation with Data Augmentation for Offline Reinforcement Learning

ACL-QL: Adaptive Conservative Level in Q-Learning for Offline Reinforcement Learning

PROTO: Iterative Policy Regularized Offline-to-Online Reinforcement Learning

Real World Offline Reinforcement Learning with Realistic Data Source

Challenges and Opportunities in Offline Reinforcement Learning from Visual Observations

Adaptable Conservative Q-Learning for Offline Reinforcement Learning.