Abstract:By properly utilizing the learned environment model, model-based reinforcement learning methods can improve the sample efficiency for decision-making problems. Beyond using the learned environment model to train a policy, the success of MCTS-based methods shows that directly incorporating the learned environment model as a planner to make decisions might be more effective. However, when action space is of high dimension and continuous, directly planning according to the learned model is costly and non-trivial. Because of two challenges: (1) the infinite number of candidate actions and (2) the temporal dependency between actions in different timesteps. To address these challenges, inspired by Differential Dynamic Programming (DDP) in optimal control theory, we design a novel Policy Optimization with Model Planning (POMP) algorithm, which incorporates a carefully designed Deep Differential Dynamic Programming (D3P) planner into the model-based RL framework. In D3P planner, (1) to effectively plan in the continuous action space, we construct a locally quadratic programming problem that uses a gradient-based optimization process to replace search. (2) To take the temporal dependency of actions at different timesteps into account, we leverage the updated and latest actions of previous timesteps (i.e., step $1, \cdots, h-1$) to update the action of the current step (i.e., step $h$), instead of updating all actions simultaneously. We theoretically prove the convergence rate for our D3P planner and analyze the effect of the feedback term. In practice, to effectively apply the neural network based D3P planner in reinforcement learning, we leverage the policy network to initialize the action sequence and keep the action update conservative in the planning process. Experiments demonstrate that POMP consistently improves sample efficiency on widely used continuous control tasks. Our code is released at https://github.com/POMP-D3P/POMP-D3P.

Decision Making in Non-Stationary Environments with Policy-Augmented Monte Carlo Tree Search

Decision Making in Non-Stationary Environments with Policy-Augmented Search

Act as You Learn: Adaptive Decision-Making in Non-Stationary Markov Decision Processes

Policy Gradient Algorithms with Monte Carlo Tree Learning for Non-Markov Decision Processes

Enhancing Reinforcement Learning Through Guided Search

Optimized Monte Carlo Tree Search for Enhanced Decision Making in the FrozenLake Environment

C-MCTS: Safe Planning with Monte Carlo Tree Search

An Analysis on the Effects of Evolving the Monte Carlo Tree Search Upper Confidence for Trees Selection Policy on Unimodal, Multimodal and Deceptive Landscapes

Bayes Adaptive Monte Carlo Tree Search for Offline Model-based Reinforcement Learning

Non-Deterministic Policies in Markovian Decision Processes

Threshold UCT: Cost-Constrained Monte Carlo Tree Search with Pareto Curves

An Efficient Dynamic Sampling Policy for Monte Carlo Tree Search.

Continuous Monte Carlo Graph Search

Monte Carlo tree search control scheme for multibody dynamics applications

Decision Mamba: A Multi-Grained State Space Model with Self-Evolution Regularization for Offline RL

Safe Reinforcement Learning for Autonomous Vehicle Using Monte Carlo Tree Search

Acting in Delayed Environments with Non-Stationary Markov Policies

Making Better Decision by Directly Planning in Continuous Control

Online model adaptation in Monte Carlo tree search planning

Maneuver Decision-Making Through Proximal Policy Optimization And Monte Carlo Tree Search