Abstract:Offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data. This problem setting offers the promise of utilizing such datasets to acquire policies without any costly or dangerous active exploration. However, it is also challenging, due to the distributional shift between the offline training data and those states visited by the learned policy. Despite significant recent progress, the most successful prior methods are model-free and constrain the policy to the support of data, precluding generalization to unseen states. In this paper, we first observe that an existing model-based RL algorithm already produces significant gains in the offline setting compared to model-free approaches. However, standard model-based RL methods, designed for the online setting, do not provide an explicit mechanism to avoid the offline setting's distributional shift issue. Instead, we propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics. We theoretically show that the algorithm maximizes a lower bound of the policy's return under the true MDP. We also characterize the trade-off between the gain and risk of leaving the support of the batch data. Our algorithm, Model-based Offline Policy Optimization (MOPO), outperforms standard model-based RL algorithms and prior state-of-the-art model-free offline RL algorithms on existing offline RL benchmarks and two challenging continuous control tasks that require generalizing from data collected for a different task. The code is available at <a class="link-external link-https" href="https://github.com/tianheyu927/mopo" rel="external noopener nofollow">this https URL</a>.

Offline Policy Reuse-Guided Anytime Online Collective Multiagent Planning and Its Application to Mobility-on-demand Systems

Learning to Cooperate: Application of Deep Reinforcement Learning for Online AGV Path Finding.

Optimal Control-Based Online Motion Planning For Cooperative Lane Changes Of Connected And Automated Vehicles

Scalable Model-based Policy Optimization for Decentralized Networked Systems

A Policy-Driven Multi-Agent System For Ogsa-Compliant Grid Control

Multi-agent policy learning-based path planning for autonomous mobile robots

Deep Reinforcement Learning Based Computation Offloading in Heterogeneous MEC Assisted by Ground Vehicles and Unmanned Aerial Vehicles

Hybrid Heuristic Online Planning for POMDPs

Communication-Efficient Cooperative Multi-Agent PPO via Regulated Segment Mixture in Internet of Vehicles

Multi-Objective Multi-Agent Planning for Discovering and Tracking Multiple Mobile Objects

Online Planning in POMDPs with State-Requests

MOPO: Model-based Offline Policy Optimization

Adaptive Online Packing-guided Search for POMDPs

Multi-Agent Soft Actor-Critic with Global Loss for Autonomous Mobility-on-Demand Fleet Control

Offline Multi-Agent Reinforcement Learning via In-Sample Sequential Policy Optimization

B2MAPO: A Batch-by-Batch Multi-Agent Policy Optimization to Balance Performance and Efficiency

Optimal Multilayered Motion Planning for Multiple Differential Drive Mobile Robots with Hierarchical Prioritization (OM-MP)

The state of marriage: contemporary marriage.

Learning-based Online Optimization for Autonomous Mobility-on-Demand Fleet Control

Lifelong Path Planning with Kinematic Constraints for Multi-Agent Pickup and Delivery