Abstract:By properly utilizing the learned environment model, model-based reinforcement learning methods can improve the sample efficiency for decision-making problems. Beyond using the learned environment model to train a policy, the success of MCTS-based methods shows that directly incorporating the learned environment model as a planner to make decisions might be more effective. However, when action space is of high dimension and continuous, directly planning according to the learned model is costly and non-trivial. Because of two challenges: (1) the infinite number of candidate actions and (2) the temporal dependency between actions in different timesteps. To address these challenges, inspired by Differential Dynamic Programming (DDP) in optimal control theory, we design a novel Policy Optimization with Model Planning (POMP) algorithm, which incorporates a carefully designed Deep Differential Dynamic Programming (D3P) planner into the model-based RL framework. In D3P planner, (1) to effectively plan in the continuous action space, we construct a locally quadratic programming problem that uses a gradient-based optimization process to replace search. (2) To take the temporal dependency of actions at different timesteps into account, we leverage the updated and latest actions of previous timesteps (i.e., step $1, \cdots, h-1$) to update the action of the current step (i.e., step $h$), instead of updating all actions simultaneously. We theoretically prove the convergence rate for our D3P planner and analyze the effect of the feedback term. In practice, to effectively apply the neural network based D3P planner in reinforcement learning, we leverage the policy network to initialize the action sequence and keep the action update conservative in the planning process. Experiments demonstrate that POMP consistently improves sample efficiency on widely used continuous control tasks. Our code is released at https://github.com/POMP-D3P/POMP-D3P.

Model-based motion planning in POMDPs with temporal logic specifications

Model-free Motion Planning of Autonomous Agents for Complex Tasks in Partially Observable Environments

Model-free reinforcement learning for motion planning of autonomous agents with complex tasks in partially observable environments

Optimal Probabilistic Motion Planning with Potential Infeasible LTL Constraints

Modular Deep Reinforcement Learning for Continuous Motion Planning With Temporal Logic

Physics-based Motion Planning with Temporal Logic Specifications

PODDP: Partially Observable Differential Dynamic Programming for Latent Belief Space Planning

Reinforcement Learning with Temporal Logic Constraints for Partially-Observable Markov Decision Processes

Trust-Aware Motion Planning for Human-Robot Collaboration under Distribution Temporal Logic Specifications

Reinforcement Learning Based Temporal Logic Control with Maximum Probabilistic Satisfaction

A Unified Approach to Multi-task Legged Navigation: Temporal Logic Meets Reinforcement Learning

Hierarchical Motion Planning Under Probabilistic Temporal Tasks and Safe-Return Constraints

Synthesis of output feedback control for motion planning based on LTL specifications

Modeling and Control Architecture for the Competitive Networked Robot System Based on POMDP

Making Better Decision by Directly Planning in Continuous Control

Scaling Long-Horizon Online POMDP Planning via Rapid State Space Sampling

Point-Based Methods for Model Checking in Partially Observable Markov Decision Processes

Vision-Based Reactive Temporal Logic Motion Planning for Quadruped Robots in Unstructured Dynamic Environments

Control Theory Meets POMDPs: A Hybrid Systems Approach

Reasoning and Predicting POMDP Planning Complexity Via Covering Numbers

Multi-Agent Motion Planning From Signal Temporal Logic Specifications