Abstract:By properly utilizing the learned environment model, model-based reinforcement learning methods can improve the sample efficiency for decision-making problems. Beyond using the learned environment model to train a policy, the success of MCTS-based methods shows that directly incorporating the learned environment model as a planner to make decisions might be more effective. However, when action space is of high dimension and continuous, directly planning according to the learned model is costly and non-trivial. Because of two challenges: (1) the infinite number of candidate actions and (2) the temporal dependency between actions in different timesteps. To address these challenges, inspired by Differential Dynamic Programming (DDP) in optimal control theory, we design a novel Policy Optimization with Model Planning (POMP) algorithm, which incorporates a carefully designed Deep Differential Dynamic Programming (D3P) planner into the model-based RL framework. In D3P planner, (1) to effectively plan in the continuous action space, we construct a locally quadratic programming problem that uses a gradient-based optimization process to replace search. (2) To take the temporal dependency of actions at different timesteps into account, we leverage the updated and latest actions of previous timesteps (i.e., step $1, \cdots, h-1$) to update the action of the current step (i.e., step $h$), instead of updating all actions simultaneously. We theoretically prove the convergence rate for our D3P planner and analyze the effect of the feedback term. In practice, to effectively apply the neural network based D3P planner in reinforcement learning, we leverage the policy network to initialize the action sequence and keep the action update conservative in the planning process. Experiments demonstrate that POMP consistently improves sample efficiency on widely used continuous control tasks. Our code is released at https://github.com/POMP-D3P/POMP-D3P.

Expert-guided Policy Optimization by Latent Space Planning with Attention

Deterministic Policy Optimization by Combining Pathwise and Score Function Estimators for Discrete Action Spaces

PcLast: Discovering Plannable Continuous Latent States

Policy Optimization with Smooth Guidance Learned from State-Only Demonstrations

Semantic Predictive Control For Explainable And Efficient Policy Learning

Improved Exploration through Latent Trajectory Optimization in Deep Deterministic Policy Gradient

Imagined Value Gradients: Model-Based Policy Optimization with Transferable Latent Dynamics Models

Efficient Planning with Latent Diffusion

Trajectory-Oriented Policy Optimization with Sparse Rewards

Making Better Decision by Directly Planning in Continuous Control

Jointly Learning Environments and Control Policies with Projected Stochastic Gradient Ascent

Combinatorial Optimization with Policy Adaptation using Latent Space Search

Low-Switching Policy Gradient with Exploration via Online Sensitivity Sampling

Hierarchical Policies for Cluttered-Scene Grasping with Latent Plans

Mixed Reinforcement Learning for Efficient Policy Optimization in Stochastic Environments

Soft Policy Optimization Using Dual-Track Advantage Estimator.

Scheduled Policy Optimization for Natural Language Communication with Intelligent Agents.

Careful at Estimation and Bold at Exploration

dGrasp: NeRF-Informed Implicit Grasp Policies with Supervised Optimization Slopes

Adaptive-Gradient Policy Optimization: Enhancing Policy Learning in Non-Smooth Differentiable Simulations

Dual Models to Facilitate Learning of Policy Network