Abstract:Actor-critic (AC) is a powerful method for learning an optimal policy in reinforcement learning, where the critic uses algorithms, e.g., temporal difference (TD) learning with function approximation, to evaluate the current policy and the actor updates the policy along an approximate gradient direction using information from the critic. This paper provides the \textit{tightest} non-asymptotic convergence bounds for both the AC and natural AC (NAC) algorithms. Specifically, existing studies show that AC converges to an $\epsilon+\varepsilon_{\text{critic}}$ neighborhood of stationary points with the best known sample complexity of $\mathcal{O}(\epsilon^{-2})$ (up to a log factor), and NAC converges to an $\epsilon+\varepsilon_{\text{critic}}+\sqrt{\varepsilon_{\text{actor}}}$ neighborhood of the global optimum with the best known sample complexity of $\mathcal{O}(\epsilon^{-3})$, where $\varepsilon_{\text{critic}}$ is the approximation error of the critic and $\varepsilon_{\text{actor}}$ is the approximation error induced by the insufficient expressive power of the parameterized policy class. This paper analyzes the convergence of both AC and NAC algorithms with compatible function approximation. Our analysis eliminates the term $\varepsilon_{\text{critic}}$ from the error bounds while still achieving the best known sample complexities. Moreover, we focus on the challenging single-loop setting with a single Markovian sample trajectory. Our major technical novelty lies in analyzing the stochastic bias due to policy-dependent and time-varying compatible function approximation in the critic, and handling the non-ergodicity of the MDP due to the single Markovian sample trajectory. Numerical results are also provided in the appendix.

Finite-Sample Analysis of Off-Policy Natural Actor–Critic With Linear Function Approximation

Non-Asymptotic Analysis for Single-Loop (Natural) Actor-Critic with Compatible Function Approximation

An Approximate Policy Iteration Viewpoint of Actor-Critic Algorithms

On the sample complexity of actor-critic method for reinforcement learning with function approximation

Two-Timescale Critic-Actor for Average Reward MDPs with Function Approximation

Improved Sample Complexity for Global Convergence of Actor-Critic Algorithms

Actor-Critic or Critic-Actor? A Tale of Two Time Scales

On the Global Convergence of Natural Actor-Critic with Two-layer Neural Network Parametrization

Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality

High-probability sample complexities for policy evaluation with linear function approximation

Neural Network Compatible Off-Policy Natural Actor-Critic Algorithm

Compatible Gradient Approximations for Actor-Critic Algorithms

Finite-Time Analysis of Three-Timescale Constrained Actor-Critic and Constrained Natural Actor-Critic Algorithms

Linear Function Approximation as a Computationally Efficient Method to Solve Classical Reinforcement Learning Challenges

Sample and Communication-Efficient Decentralized Actor-Critic Algorithms with Finite-Time Analysis

Infinite-Horizon Offline Reinforcement Learning with Linear Function Approximation: Curse of Dimensionality and Algorithm

Natural Actor-Critic for Robust Reinforcement Learning with Function Approximation

Reinforcement Learning from Partial Observation: Linear Function Approximation with Provable Sample Efficiency

Finite-Time Complexity of Online Primal-Dual Natural Actor-Critic Algorithm for Constrained Markov Decision Processes

Stochastic Convergence Results for Regularized Actor-Critic Methods

Weak Convergence Analysis of Online Neural Actor-Critic Algorithms