Abstract:By properly utilizing the learned environment model, model-based reinforcement learning methods can improve the sample efficiency for decision-making problems. Beyond using the learned environment model to train a policy, the success of MCTS-based methods shows that directly incorporating the learned environment model as a planner to make decisions might be more effective. However, when action space is of high dimension and continuous, directly planning according to the learned model is costly and non-trivial. Because of two challenges: (1) the infinite number of candidate actions and (2) the temporal dependency between actions in different timesteps. To address these challenges, inspired by Differential Dynamic Programming (DDP) in optimal control theory, we design a novel Policy Optimization with Model Planning (POMP) algorithm, which incorporates a carefully designed Deep Differential Dynamic Programming (D3P) planner into the model-based RL framework. In D3P planner, (1) to effectively plan in the continuous action space, we construct a locally quadratic programming problem that uses a gradient-based optimization process to replace search. (2) To take the temporal dependency of actions at different timesteps into account, we leverage the updated and latest actions of previous timesteps (i.e., step $1, \cdots, h-1$) to update the action of the current step (i.e., step $h$), instead of updating all actions simultaneously. We theoretically prove the convergence rate for our D3P planner and analyze the effect of the feedback term. In practice, to effectively apply the neural network based D3P planner in reinforcement learning, we leverage the policy network to initialize the action sequence and keep the action update conservative in the planning process. Experiments demonstrate that POMP consistently improves sample efficiency on widely used continuous control tasks. Our code is released at https://github.com/POMP-D3P/POMP-D3P.

A Scalable Model-Free Recurrent Neural Network Framework for Solving POMDPs

ODE-based Recurrent Model-free Reinforcement Learning for POMDPs

Verifiable RNN-Based Policies for POMDPs Under Temporal Logic Constraints

Recurrent Natural Policy Gradient for POMDPs

A fast algorithm for solving large scale nonlinear optimization problems using RNN

Recursively-Constrained Partially Observable Markov Decision Processes

Partially Observable Planning and Learning for Systems with Non-Uniform Dynamics

Deep Recurrent Policy Networks for Planning under Partial Observability.

Counterexample-Guided Strategy Improvement for POMDPs Using Recurrent Neural Networks

Real-Time Recurrent Reinforcement Learning

GRSN: Gated Recurrent Spiking Neurons for POMDPs and MARL

Scaling Long-Horizon Online POMDP Planning via Rapid State Space Sampling

Making Better Decision by Directly Planning in Continuous Control

Recurrent Model Predictive Control: Learning an Explicit Recurrent Controller for Nonlinear Systems

Analytical Solution to A Discrete-Time Model for Dynamic Learning and Decision-Making

Scalable Model-based Policy Optimization for Decentralized Networked Systems

SVQN: Sequential Variational Soft Q-Learning Networks

On-Robot Bayesian Reinforcement Learning for POMDPs

On Improving Deep Reinforcement Learning for POMDPs

OMPO: A Unified Framework for RL under Policy and Dynamics Shifts

End-to-End Policy Gradient Method for POMDPs and Explainable Agents