Abstract:Most offline reinforcement learning (RL) algorithms return a target policy maximizing a trade-off between (1) the expected performance gain over the behavior policy that collected the dataset, and (2) the risk stemming from the out-of-distribution-ness of the induced state-action occupancy. It follows that the performance of the target policy is strongly related to the performance of the behavior policy and, thus, the trajectory return distribution of the dataset. We show that in mixed datasets consisting of mostly low-return trajectories and minor high-return trajectories, state-of-the-art offline RL algorithms are overly restrained by low-return trajectories and fail to exploit high-performing trajectories to the fullest. To overcome this issue, we show that, in deterministic MDPs with stochastic initial states, the dataset sampling can be re-weighted to induce an artificial dataset whose behavior policy has a higher return. This re-weighted sampling strategy may be combined with any offline RL algorithm. We further analyze that the opportunity for performance improvement over the behavior policy correlates with the positive-sided variance of the returns of the trajectories in the dataset. We empirically show that while CQL, IQL, and TD3+BC achieve only a part of this potential policy improvement, these same algorithms combined with our reweighted sampling strategy fully exploit the dataset. Furthermore, we empirically demonstrate that, despite its theoretical limitation, the approach may still be efficient in stochastic environments. The code is available at <a class="link-external link-https" href="https://github.com/Improbable-AI/harness-offline-rl" rel="external noopener nofollow">this https URL</a>.

TrajDeleter: Enabling Trajectory Forgetting in Offline Reinforcement Learning Agents

DROP: Conservative Model-based Optimization for Offline Reinforcement Learning

Offline Safe Reinforcement Learning Using Trajectory Classification

Prioritized Trajectory Replay: A Replay Memory for Data-driven Reinforcement Learning

Harnessing Mixed Offline Reinforcement Learning Datasets via Trajectory Weighting

ATraDiff: Accelerating Online Reinforcement Learning with Imaginary Trajectories

Uncertainty-driven Trajectory Truncation for Data Augmentation in Offline Reinforcement Learning

In-Dataset Trajectory Return Regularization for Offline Preference-based Reinforcement Learning

Using Offline Data to Speed-up Reinforcement Learning in Procedurally Generated Environments

Adaptive and Multiple Time-scale Eligibility Traces for Online Deep Reinforcement Learning

Learning from Good Trajectories in Offline Multi-Agent Reinforcement Learning

A Trajectory Perspective on the Role of Data Sampling Techniques in Offline Reinforcement Learning.

Train Trajectory Optimization with High-Risk State Space Boundaries: A Safe Reinforcement Learning Approach

Efficient Online Reinforcement Learning with Offline Data

Offline Trajectory Generalization for Offline Reinforcement Learning

Optimizing Trajectories for Highway Driving with Offline Reinforcement Learning

Rethinking Trajectory Prediction in Real-World Applications: an Online Task-Free Continual Learning Perspective

TEA: Trajectory Encoding Augmentation for Robust and Transferable Policies in Offline Reinforcement Learning

Model-based Trajectory Stitching for Improved Offline Reinforcement Learning

Efficient Reinforcement Learning Through Trajectory Generation

Active Reinforcement Learning Strategies for Offline Policy Improvement