Abstract:By properly utilizing the learned environment model, model-based reinforcement learning methods can improve the sample efficiency for decision-making problems. Beyond using the learned environment model to train a policy, the success of MCTS-based methods shows that directly incorporating the learned environment model as a planner to make decisions might be more effective. However, when action space is of high dimension and continuous, directly planning according to the learned model is costly and non-trivial. Because of two challenges: (1) the infinite number of candidate actions and (2) the temporal dependency between actions in different timesteps. To address these challenges, inspired by Differential Dynamic Programming (DDP) in optimal control theory, we design a novel Policy Optimization with Model Planning (POMP) algorithm, which incorporates a carefully designed Deep Differential Dynamic Programming (D3P) planner into the model-based RL framework. In D3P planner, (1) to effectively plan in the continuous action space, we construct a locally quadratic programming problem that uses a gradient-based optimization process to replace search. (2) To take the temporal dependency of actions at different timesteps into account, we leverage the updated and latest actions of previous timesteps (i.e., step $1, \cdots, h-1$) to update the action of the current step (i.e., step $h$), instead of updating all actions simultaneously. We theoretically prove the convergence rate for our D3P planner and analyze the effect of the feedback term. In practice, to effectively apply the neural network based D3P planner in reinforcement learning, we leverage the policy network to initialize the action sequence and keep the action update conservative in the planning process. Experiments demonstrate that POMP consistently improves sample efficiency on widely used continuous control tasks. Our code is released at https://github.com/POMP-D3P/POMP-D3P.

When to Sense and Control? A Time-adaptive Approach for Continuous-Time RL

Time‐in‐action RL

Dynamic Decision Frequency with Continuous Options

Managing Temporal Resolution in Continuous Value Estimation: A Fundamental Trade-off

Sample-Efficient Reinforcement Learning with Temporal Logic Objectives: Leveraging the Task Specification to Guide Exploration

Temporal Difference Models: Model-Free Deep RL for Model-Based Control

Actor-Critic with variable time discretization via sustained actions

Efficient Exploration in Continuous-time Model-based Reinforcement Learning

Continuous-Time Reinforcement Learning: New Design Algorithms with Theoretical Insights and Performance Guarantees

Time-Constrained Robust MDPs

A Time-Aggregated Model-Free RL Algorithm for Optimal Containment Control of MASs

Robust Safe Reinforcement Learning Control of Unknown Continuous-Time Nonlinear Systems with State Constraints and Disturbances

Model-Based Safe Reinforcement Learning With Time-Varying Constraints: Applications to Intelligent Vehicles

Making Better Decision by Directly Planning in Continuous Control

Deployable Reinforcement Learning with Variable Control Rate

Steady-State Error Compensation in Reference Tracking and Disturbance Rejection Problems for Reinforcement Learning-Based Control

Learning Optimal Strategies for Temporal Tasks in Stochastic Games

CORL: A Continuous-state Offset-dynamics Reinforcement Learner

Reinforcement Learning for Temporal Logic Control Synthesis with Probabilistic Satisfaction Guarantees

Overcoming Slow Decision Frequencies in Continuous Control: Model-Based Sequence Reinforcement Learning for Model-Free Control

Signal Temporal Logic Neural Predictive Control