Abstract:By properly utilizing the learned environment model, model-based reinforcement learning methods can improve the sample efficiency for decision-making problems. Beyond using the learned environment model to train a policy, the success of MCTS-based methods shows that directly incorporating the learned environment model as a planner to make decisions might be more effective. However, when action space is of high dimension and continuous, directly planning according to the learned model is costly and non-trivial. Because of two challenges: (1) the infinite number of candidate actions and (2) the temporal dependency between actions in different timesteps. To address these challenges, inspired by Differential Dynamic Programming (DDP) in optimal control theory, we design a novel Policy Optimization with Model Planning (POMP) algorithm, which incorporates a carefully designed Deep Differential Dynamic Programming (D3P) planner into the model-based RL framework. In D3P planner, (1) to effectively plan in the continuous action space, we construct a locally quadratic programming problem that uses a gradient-based optimization process to replace search. (2) To take the temporal dependency of actions at different timesteps into account, we leverage the updated and latest actions of previous timesteps (i.e., step $1, \cdots, h-1$) to update the action of the current step (i.e., step $h$), instead of updating all actions simultaneously. We theoretically prove the convergence rate for our D3P planner and analyze the effect of the feedback term. In practice, to effectively apply the neural network based D3P planner in reinforcement learning, we leverage the policy network to initialize the action sequence and keep the action update conservative in the planning process. Experiments demonstrate that POMP consistently improves sample efficiency on widely used continuous control tasks. Our code is released at https://github.com/POMP-D3P/POMP-D3P.

Data-Efficient Reinforcement Learning in Continuous-State POMDPs

Data-Efficient Reinforcement Learning Using Active Exploration Method.

An Active Exploration Method for Data Efficient Reinforcement Learning

Robust Reinforcement Learning in POMDPs with Incomplete and Noisy Observations

POMDPs in Continuous Time and Discrete Spaces

Dynamic Observation Policies in Observation Cost-Sensitive Reinforcement Learning

Online Reinforcement Learning for Real-Time Exploration in Continuous State and Action Markov Decision Processes

Improving PILCO with Bayesian Neural Network Dynamics Models

Learning Interpretable Policies in Hindsight-Observable POMDPs through Partially Supervised Reinforcement Learning

An efficient reinforcement learning algorithm for learning deterministic policies in continuous domains

Online algorithms for POMDPs with continuous state, action, and observation spaces

Making Better Decision by Directly Planning in Continuous Control

CORL: A Continuous-state Offset-dynamics Reinforcement Learner

Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in Partially Observed Markov Decision Processes

Sample Efficient Reinforcement Learning In Continuous State Spaces: A Perspective Beyond Linearity

OCMDP: Observation-Constrained Markov Decision Process

Addressing Action Oscillations Through Learning Policy Inertia

Tracking as Online Decision-Making: Learning a Policy from Streaming Videos with Reinforcement Learning

Model-Based Reinforcement Learning In Continuous Environments Using Real-Time Constrained Optimization

Robust On-Policy Sampling for Data-Efficient Policy Evaluation in Reinforcement Learning

Deep reinforcement learning for the olfactory search POMDP: a quantitative benchmark