Bidirectional Model-based Policy Optimization

Hang Lai,Jian Shen,Weinan Zhang,Yong Yu

DOI: https://doi.org/10.48550/arXiv.2007.01995

2020-09-29

Abstract:Model-based reinforcement learning approaches leverage a forward dynamics model to support planning and decision making, which, however, may fail catastrophically if the model is inaccurate. Although there are several existing methods dedicated to combating the model error, the potential of the single forward model is still limited. In this paper, we propose to additionally construct a backward dynamics model to reduce the reliance on accuracy in forward model predictions. We develop a novel method, called Bidirectional Model-based Policy Optimization (BMPO) to utilize both the forward model and backward model to generate short branched rollouts for policy optimization. Furthermore, we theoretically derive a tighter bound of return discrepancy, which shows the superiority of BMPO against the one using merely the forward model. Extensive experiments demonstrate that BMPO outperforms state-of-the-art model-based methods in terms of sample efficiency and asymptotic performance.

Machine Learning,Artificial Intelligence

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the error accumulation problem in multi - step prediction of model - based methods (Model - based Reinforcement Learning, MBRL) in reinforcement learning. Specifically, traditional MBRL methods rely on the Forward Dynamics Model to generate simulated trajectories to support planning and decision - making. However, if this model is not accurate enough, it may lead to catastrophic failures. Especially in multi - step prediction, due to the accumulation of errors, the predictive ability of the model will decline significantly. Although there are already several methods dedicated to reducing model errors, the potential of a single forward model is still limited. To solve this problem, the paper proposes a novel method - Bidirectional Model - based Policy Optimization (BMPO). BMPO reduces the dependence on the accuracy of the forward model by constructing a Backward Dynamics Model. This method uses both the forward and backward models to generate short - branched trajectories for policy optimization. In addition, the paper also theoretically derives a tighter upper bound on the return difference, which indicates that BMPO has an advantage over methods that only use the forward model. Through extensive experiments, the research has proven that BMPO is superior to the existing state - of - the - art model - based methods in terms of sample efficiency and asymptotic performance.

Bidirectional Model-based Policy Optimization

Model-Based Robot Learning Control with Uncertainty Directed Exploration

Model-Based Reinforcement Learning via Meta-Policy Optimization

Bidirectional Model-Based Policy Optimization Based on Adaptive Gaussian Noise and Improved Confidence Weights.

Dyna-style Model-based reinforcement learning with Model-Free Policy Optimization

Deep Model-Based Reinforcement Learning via Estimated Uncertainty and Conservative Policy Optimization

Model-based Policy Optimization using Symbolic World Model

Model-based Multi-agent Policy Optimization with Adaptive Opponent-wise Rollouts

Policy Optimization with Model-based Explorations

Model-Based Offline Weighted Policy Optimization (Student Abstract)

Model-Based Decentralized Policy Optimization

Scalable Model-based Policy Optimization for Decentralized Networked Systems

Conservative Bayesian Model-Based Value Expansion for Offline Policy Optimization

How to Fine-tune the Model: Unified Model Shift and Model Bias Policy Optimization

Online Policy Optimization for Robust MDP

Gradient Information Matters in Policy Optimization by Back-propagating through Model

Optimistic Model Rollouts for Pessimistic Offline Policy Optimization

MOPO: Model-based Offline Policy Optimization

Model-based Deep Reinforcement Learning for Dynamic Portfolio Optimization

When to Update Your Model: Constrained Model-based Reinforcement Learning

Adversarial Constrained Policy Optimization: Improving Constrained Reinforcement Learning by Adapting Budgets