Abstract:This article presents a reinforcement learning framework for continuous-time dynamical systems without a priori discretization of time, state, and action. Basedonthe Hamilton-Jacobi-Bellman (HJB) equation for infinite-horizon, discounted reward problems, we derive algorithms for estimating value functions and improving policies with the use of function approximators. The process of value function estimation is formulated as the minimization of a continuous-time form of the temporal difference (TD) error. Update methods based on backward Euler approximation and exponential eligibility traces are derived, and their correspondences with the conventional residual gradient, TD (0), and TD (λ) algorithms are shown. For policy improvement, two methods—a continuous actor-critic method and a value-gradient-based greedy policy—are formulated. As a special case of the latter, a nonlinear feedback control law using the value gradient and the model of the input gain is derived. The advantage updating, a model-free algorithm derived previously, is also formulated in the HJB-based framework. The performance of the proposed algorithms is first tested in a nonlinear control task of swinging a pendulum up with limited torque. It is shown in the simulations that (1) the task is accomplished by the continuous actor-critic method in a number of trials several times fewer than by the conventional discrete actor-critic method; (2) among the continuous policy update methods, the value-gradient-based policy with a known or learned dynamic model performs several times better than the actor-critic method; and (3) a value function update using exponential eligibility traces is more efficient and stable than that based on Euler approximation. The algorithms are then tested in a higher-dimensional task: cart-pole swing-up. This task is accomplished in several hundred trials using the value-gradient-based policy with a learned dynamic model.

Reinforcement Learning for Continuous-Time Optimal Execution: Actor-Critic Algorithm and Error Analysis

Learn Continuously, Act Discretely: Hybrid Action-Space Reinforcement Learning For Optimal Execution

Sublinear Regret for a Class of Continuous-Time Linear--Quadratic Reinforcement Learning Problems

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Reinforcement Learning for Learning Rate Control.

Reinforcement Learning for Continuous-Time Mean-Variance Portfolio Selection in a Regime-Switching Market

Reinforcement Learning in Continuous Time and Space

Actor-Critic Reinforcement Learning with Phased Actor

Continuous‐time mean–variance portfolio selection: A reinforcement learning framework

A Single-Loop Deep Actor-Critic Algorithm for Constrained Reinforcement Learning with Provable Convergence

Efficient Continuous Control with Double Actors and Regularized Critics

Optimal Scheduling of Entropy Regularizer for Continuous-Time Linear-Quadratic Reinforcement Learning

Deep Reinforcement Learning for Online Optimal Execution Strategies

Two Kinds of Learning Algorithms for Continuous-Time VWAP Targeting Execution

Broad Critic Deep Actor Reinforcement Learning for Continuous Control

Reinforcement Learning for a Discrete-Time Linear-Quadratic Control Problem with an Application

Reinforcement Learning in Non-Markov Market-Making

Model-Based Safe Reinforcement Learning With Time-Varying Constraints: Applications to Intelligent Vehicles

Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors

Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model