Abstract:By properly utilizing the learned environment model, model-based reinforcement learning methods can improve the sample efficiency for decision-making problems. Beyond using the learned environment model to train a policy, the success of MCTS-based methods shows that directly incorporating the learned environment model as a planner to make decisions might be more effective. However, when action space is of high dimension and continuous, directly planning according to the learned model is costly and non-trivial. Because of two challenges: (1) the infinite number of candidate actions and (2) the temporal dependency between actions in different timesteps. To address these challenges, inspired by Differential Dynamic Programming (DDP) in optimal control theory, we design a novel Policy Optimization with Model Planning (POMP) algorithm, which incorporates a carefully designed Deep Differential Dynamic Programming (D3P) planner into the model-based RL framework. In D3P planner, (1) to effectively plan in the continuous action space, we construct a locally quadratic programming problem that uses a gradient-based optimization process to replace search. (2) To take the temporal dependency of actions at different timesteps into account, we leverage the updated and latest actions of previous timesteps (i.e., step $1, \cdots, h-1$) to update the action of the current step (i.e., step $h$), instead of updating all actions simultaneously. We theoretically prove the convergence rate for our D3P planner and analyze the effect of the feedback term. In practice, to effectively apply the neural network based D3P planner in reinforcement learning, we leverage the policy network to initialize the action sequence and keep the action update conservative in the planning process. Experiments demonstrate that POMP consistently improves sample efficiency on widely used continuous control tasks. Our code is released at https://github.com/POMP-D3P/POMP-D3P.

Model-based Reinforcement Learning for Semi-Markov Decision Processes with Neural ODEs

ODE-based Recurrent Model-free Reinforcement Learning for POMDPs

Efficient Exploration in Continuous-time Model-based Reinforcement Learning

Model-based Reinforcement Learning with a Hamiltonian Canonical ODE Network

Model-Based Reinforcement Learning via Stochastic Hybrid Models

Model-based Meta Reinforcement Learning using Graph Structured Surrogate Models and Amortized Policy Search

Model-based Safe Deep Reinforcement Learning via a Constrained Proximal Policy Optimization Algorithm

Tackling Decision Processes with Non-Cumulative Objectives using Reinforcement Learning

Towards Solving Industrial Sequential Decision-making Tasks under Near-predictable Dynamics via Reinforcement Learning: an Implicit Corrective Value Estimation Approach

Model-based Deep Reinforcement Learning for Dynamic Portfolio Optimization

Model-Based Reinforcement Learning via Meta-Policy Optimization

Bellman Meets Hawkes: Model-Based Reinforcement Learning via Temporal Point Processes

Making Better Decision by Directly Planning in Continuous Control

Semi-Infinitely Constrained Markov Decision Processes and Provably Efficient Reinforcement Learning.

Safe Model-Based Reinforcement Learning for Systems with Parametric Uncertainties

Deep Online Learning via Meta-Learning: Continual Adaptation for Model-Based RL

Model-Based Reinforcement Learning Control of Reaction-Diffusion Problems

A Deep Reinforcement Learning Approach to Asset-Liability Management

Model-Free Reinforcement Learning for Stochastic Games with Linear Temporal Logic Objectives

Mixed Reinforcement Learning for Efficient Policy Optimization in Stochastic Environments

Reinforcement learning based MPC with neural dynamical models