Abstract:Policy gradient methods in actor-critic reinforcement learning (RL) have become perhaps the most promising approaches to solving continuous optimal control problems. However, the trial-and-error nature of RL and the inherent randomness associated with solution approximations cause variations in the learned optimal values and policies. This has significantly hindered their successful deployment in real life applications where control responses need to meet dynamic performance criteria deterministically. Here we propose a novel phased actor in actor-critic (PAAC) method, aiming at improving policy gradient estimation and thus the quality of the control policy. Specifically, PAAC accounts for both $Q$ value and TD error in its actor update. We prove qualitative properties of PAAC for learning convergence of the value and policy, solution optimality, and stability of system dynamics. Additionally, we show variance reduction in policy gradient estimation. PAAC performance is systematically and quantitatively evaluated in this study using DeepMind Control Suite (DMC). Results show that PAAC leads to significant performance improvement measured by total cost, learning variance, robustness, learning speed and success rate. As PAAC can be piggybacked onto general policy gradient learning frameworks, we select well-known methods such as direct heuristic dynamic programming (dHDP), deep deterministic policy gradient (DDPG) and their variants to demonstrate the effectiveness of PAAC. Consequently we provide a unified view on these related policy gradient algorithms.

Policy Iteration Approximate Dynamic Programming Using Volterra Series Based Actor

Policy Approximation in Policy Iteration Approximate Dynamic Programming for Discrete-Time Nonlinear Systems.

Error bound analysis of policy iteration based approximate dynamic programming for deterministic discrete-time nonlinear systems

Revisiting approximate dynamic programming and its convergence

Model-free Adaptive Dynamic Programming for Optimal Control of Discrete-time Affine Nonlinear System

An Approximate Policy Iteration Viewpoint of Actor-Critic Algorithms

Linear Function Approximation as a Computationally Efficient Method to Solve Classical Reinforcement Learning Challenges

Policy-Iteration-Based Finite-Horizon Approximate Dynamic Programming for Continuous-Time Nonlinear Optimal Control

A policy iteration algorithm for non-Markovian control problems

Value-Gradient Iteration with Quadratic Approximate Value Functions

Approximate Policy Iteration for Robust Stochastic Control of Multi-agent Markov Decision Processes

Policy Iteration Based Approximate Dynamic Programming Toward Autonomous Driving in Constrained Dynamic Environment

Representation Policy Iteration

Twin Deterministic Policy Gradient Adaptive Dynamic Programming for Optimal Control of Affine Nonlinear Discrete-time Systems

Efficient approximate dynamic programming based on design and analysis of computer experiments for infinite-horizon optimization

Approximate Finite-Horizon Optimal Control with Policy Iteration

Actor-Critic Reinforcement Learning with Phased Actor

Approximate Modified Policy Iteration

Approximate Linear Programming for Decentralized Policy Iteration in Cooperative Multi-agent Markov Decision Processes

Approximate Midpoint Policy Iteration for Linear Quadratic Control

Hamiltonian-Driven Adaptive Dynamic Programming With Approximation Errors