Abstract:By properly utilizing the learned environment model, model-based reinforcement learning methods can improve the sample efficiency for decision-making problems. Beyond using the learned environment model to train a policy, the success of MCTS-based methods shows that directly incorporating the learned environment model as a planner to make decisions might be more effective. However, when action space is of high dimension and continuous, directly planning according to the learned model is costly and non-trivial. Because of two challenges: (1) the infinite number of candidate actions and (2) the temporal dependency between actions in different timesteps. To address these challenges, inspired by Differential Dynamic Programming (DDP) in optimal control theory, we design a novel Policy Optimization with Model Planning (POMP) algorithm, which incorporates a carefully designed Deep Differential Dynamic Programming (D3P) planner into the model-based RL framework. In D3P planner, (1) to effectively plan in the continuous action space, we construct a locally quadratic programming problem that uses a gradient-based optimization process to replace search. (2) To take the temporal dependency of actions at different timesteps into account, we leverage the updated and latest actions of previous timesteps (i.e., step $1, \cdots, h-1$) to update the action of the current step (i.e., step $h$), instead of updating all actions simultaneously. We theoretically prove the convergence rate for our D3P planner and analyze the effect of the feedback term. In practice, to effectively apply the neural network based D3P planner in reinforcement learning, we leverage the policy network to initialize the action sequence and keep the action update conservative in the planning process. Experiments demonstrate that POMP consistently improves sample efficiency on widely used continuous control tasks. Our code is released at https://github.com/POMP-D3P/POMP-D3P.

OSSP-PTA: an Online Stochastic Stepping Policy for PTA on Reinforcement Learning

Adaptive Stepping PTA for DC Analysis Based on Reinforcement Learning.

Successive Convex Approximation Based Off-Policy Optimization for Constrained Reinforcement Learning

ISPT-Net: A Noval Transient Backward-Stepping Reduction Policy by Irregular Sequential Prediction Transformer

BoA-PTA: A Bayesian Optimization Accelerated PTA Solver for SPICE Simulation

Stochastic Cubic-Regularized Policy Gradient Method

The Ladder in Chaos: A Simple and Effective Improvement to General DRL Algorithms by Policy Path Trimming and Boosting

Making Better Decision by Directly Planning in Continuous Control

Toward Expedited Impedance Tuning of a Robotic Prosthesis for Personalized Gait Assistance by Reinforcement Learning Control

AdaPT: Zero-Shot Adaptive Policy Transfer for Stochastic Dynamical Systems

Off-Policy Deep Reinforcement Learning Based on Steffensen Value Iteration

Offline Learning of Closed-Loop Deep Brain Stimulation Controllers for Parkinson Disease Treatment

Towards Expedited Impedance Tuning of a Robotic Prosthesis for Personalized Gait Assistance by Reinforcement Learning Control

Time-Efficient Reinforcement Learning with Stochastic Stateful Policies

A Deep Reinforcement Learning Approach for Online Parcel Assignment

RL-Driven MPPI: Accelerating Online Control Laws Calculation with Offline Policy

Get a Head Start: On-Demand Pedagogical Policy Selection in Intelligent Tutoring

Mildly Constrained Evaluation Policy for Offline Reinforcement Learning

Advanced-Step Real-time Iterations with Four Levels -- New Error Bounds and Fast Implementation in acados

Efficient and Stable Offline-to-online Reinforcement Learning Via Continual Policy Revitalization

Deep Reinforcement Learning-Based Tie-Line Power Adjustment Method for Power System Operation State Calculation