Abstract:This article presents a reinforcement learning framework for continuous-time dynamical systems without a priori discretization of time, state, and action. Based on the Hamilton-Jacobi-Bellman (HJB) equation for infinite-horizon, discounted reward problems, we derive algorithms for estimating value functions and improving policies with the use of function approximators. The process of value function estimation is formulated as the minimization of a continuous-time form of the temporal difference (TD) error. Update methods based on backward Euler approximation and exponential eligibility traces are derived, and their correspondences with the conventional residual gradient, TD(0), and TD(lambda) algorithms are shown. For policy improvement, two methods-a continuous actor-critic method and a value-gradient-based greedy policy-are formulated. As a special case of the latter, a nonlinear feedback control law using the value gradient and the model of the input gain is derived. The advantage updating, a model-free algorithm derived previously, is also formulated in the HJB-based framework. The performance of the proposed algorithms is first tested in a nonlinear control task of swinging a pendulum up with limited torque. It is shown in the simulations that (1) the task is accomplished by the continuous actor-critic method in a number of trials several times fewer than by the conventional discrete actor-critic method; (2) among the continuous policy update methods, the value-gradient-based policy with a known or learned dynamic model performs several times better than the actor-critic method; and (3) a value function update using exponential eligibility traces is more efficient and stable than that based on Euler approximation. The algorithms are then tested in a higher-dimensional task: cart-pole swing-up. This task is accomplished in several hundred trials using the value-gradient-based policy with a learned dynamic model.

PhiBE: A PDE-based Bellman Equation for Continuous Time Policy Evaluation

On Bellman equations for continuous-time policy evaluation I: discretization and approximation

Efficient Exploration in Continuous-time Model-based Reinforcement Learning

Reinforcement learning in continuous time and space

A Generalized Projected Bellman Error for Off-policy Value Estimation in Reinforcement Learning

Policy Optimization for Continuous Reinforcement Learning

An efficient reinforcement learning algorithm for learning deterministic policies in continuous domains

Balancing Value Iteration and Policy Iteration for Discrete-Time Control.

Integral Reinforcement Learning for Linear Continuous-Time Zero-Sum Games With Completely Unknown Dynamics

Live in the Moment: Learning Dynamics Model Adapted to Evolving Policy

Reinforcement Learning Policies in Continuous-Time Linear Systems

The Uncertainty Bellman Equation and Exploration

Bellman Meets Hawkes: Model-Based Reinforcement Learning via Temporal Point Processes

Stable and Efficient Policy Evaluation.

Byzantine-Resilient Decentralized Policy Evaluation with Linear Function Approximation

Wide-Sense Stationary Policy Optimization with Bellman Residual on Video Games.

Revised Progressive-Hedging-Algorithm Based Two-layer Solution Scheme for Bayesian Reinforcement Learning

Diminishing Return of Value Expansion Methods in Model-Based Reinforcement Learning

A Priori Estimates for Deep Residual Network in Continuous-time Reinforcement Learning

Planning with Exploration: Addressing Dynamics Bottleneck in Model-based Reinforcement Learning