Abstract:This article presents a reinforcement learning framework for continuous-time dynamical systems without a priori discretization of time, state, and action. Basedonthe Hamilton-Jacobi-Bellman (HJB) equation for infinite-horizon, discounted reward problems, we derive algorithms for estimating value functions and improving policies with the use of function approximators. The process of value function estimation is formulated as the minimization of a continuous-time form of the temporal difference (TD) error. Update methods based on backward Euler approximation and exponential eligibility traces are derived, and their correspondences with the conventional residual gradient, TD (0), and TD (λ) algorithms are shown. For policy improvement, two methods—a continuous actor-critic method and a value-gradient-based greedy policy—are formulated. As a special case of the latter, a nonlinear feedback control law using the value gradient and the model of the input gain is derived. The advantage updating, a model-free algorithm derived previously, is also formulated in the HJB-based framework. The performance of the proposed algorithms is first tested in a nonlinear control task of swinging a pendulum up with limited torque. It is shown in the simulations that (1) the task is accomplished by the continuous actor-critic method in a number of trials several times fewer than by the conventional discrete actor-critic method; (2) among the continuous policy update methods, the value-gradient-based policy with a known or learned dynamic model performs several times better than the actor-critic method; and (3) a value function update using exponential eligibility traces is more efficient and stable than that based on Euler approximation. The algorithms are then tested in a higher-dimensional task: cart-pole swing-up. This task is accomplished in several hundred trials using the value-gradient-based policy with a learned dynamic model.

Kernel dynamic policy programming: Applicable reinforcement learning to robot systems with high dimensional states

Ensemble Bootstrapped Deep Deterministic Policy Gradient For Vision-Based Robotic Grasping

Deep Model-Based Reinforcement Learning for Predictive Control of Robotic Systems with Dense and Sparse Rewards

Pneumatic artificial muscle-driven robot control using local update reinforcement learning

Local Update Dynamic Policy Programming in reinforcement learning of pneumatic artificial muscle-driven humanoid hand control

Imagined Value Gradients: Model-Based Policy Optimization with Transferable Latent Dynamics Models

Kernel-Based Decentralized Policy Evaluation for Reinforcement Learning

Reinforcement Learning Tracking Control for Robotic Manipulator With Kernel-Based Dynamic Model

DOP: Deep Optimistic Planning with Approximate Value Function Evaluation

A kernel-based approximate dynamic programming approach: Theory and application

Reinforcement Learning in Continuous Time and Space

Improved Exploration through Latent Trajectory Optimization in Deep Deterministic Policy Gradient

A Modified Convergence DDPG Algorithm for Robotic Manipulation

Kalman meets Bellman: Improving Policy Evaluation through Value Tracking

Dynamic mean field programming

High-Dimensional Stochastic Optimal Control using Continuous Tensor Decompositions

A High-Efficient Reinforcement Learning Approach for Dexterous Manipulation

Robot Policy Improvement With Natural Evolution Strategies for Stable Nonlinear Dynamical System

Revisiting approximate dynamic programming and its convergence

Learning of Long-Horizon Sparse-Reward Robotic Manipulator Tasks With Base Controllers