Abstract:We revisit the standard formulation of tabular actor-critic algorithm as a two time-scale stochastic approximation with value function computed on a faster time-scale and policy computed on a slower time-scale. This emulates policy iteration. We observe that reversal of the time scales will in fact emulate value iteration and is a legitimate algorithm. We provide a proof of convergence and compare the two empirically with and without function approximation (with both linear and nonlinear function approximators) and observe that our proposed critic-actor algorithm performs on par with actor-critic in terms of both accuracy and computational effort.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to explore whether an effective algorithm can be formed when the time scale of the Actor - Critic algorithm is reversed in reinforcement learning (that is, the Critic runs on a slower time scale and the Actor runs on a faster time scale), and to analyze its performance. Specifically: 1. **Standard Actor - Critic algorithm**: Traditionally, the Actor - Critic algorithm uses two time scales for updates, where the value function (Critic) is updated on a faster time scale, and the policy (Actor) is updated on a slower time scale. This setting simulates Policy Iteration and has been widely studied and applied. 2. **Proposing the Critic - Actor algorithm**: The paper proposes a new algorithm - the Critic - Actor algorithm, which reverses the time scale, that is, the Critic is updated on a slower time scale, and the Actor is updated on a faster time scale. This setting simulates Value Iteration. The author aims to verify the effectiveness of this new algorithm and compare it with the traditional Actor - Critic algorithm. 3. **Theoretical proof and experimental verification**: In order to prove the effectiveness of the Critic - Actor algorithm, the author provides a theoretical proof of the algorithm's convergence and compares the two algorithms through a series of experiments (including tabular form and function approximation form). The experimental results show that the Critic - Actor algorithm is comparable to the Actor - Critic algorithm in terms of accuracy and computational efficiency, and even performs slightly better in some cases. ### Formula summary - **Value function update formula** (the Critic part in the Critic - Actor algorithm): \[ V_{n + 1}(i)=V_n(i)+a(\nu_1(i, n))\left[g(i, \phi_n(i), \xi_n(i, \phi_n(i)))+ \gamma V_n(\xi_n(i, \phi_n(i)))-V_n(i)\right]I\{Y_n = i\} \] - **Policy update formula** (the Actor part in the Critic - Actor algorithm): \[ \theta_{n + 1}(i, a)=\Gamma_{\theta_0}\left(\theta_n(i, a)+b(\nu_2(i, a, n))\left[V_n(i)-g(i, a, \eta_n(i, a))-\gamma V_n(\eta_n(i, a))\right]I\{Z_n=(i, a)\}\right) \] - **Bellman equation**: \[ V^*(i)=\min_{a\in U(i)}\sum_{j\in S}p(i, a, j)\left(g(i, a, j)+\gamma V^*(j)\right) \] Through these formulas and theoretical analysis, the paper demonstrates the rationality and effectiveness of the Critic - Actor algorithm and provides a basis for further research.

Actor-Critic or Critic-Actor? A Tale of Two Time Scales

Two-Timescale Critic-Actor for Average Reward MDPs with Function Approximation

On the sample complexity of actor-critic method for reinforcement learning with function approximation

Finite-Sample Analysis of Off-Policy Natural Actor–Critic With Linear Function Approximation

Non-Asymptotic Analysis for Single-Loop (Natural) Actor-Critic with Compatible Function Approximation

Compatible Gradient Approximations for Actor-Critic Algorithms

Finite-Time Analysis of Three-Timescale Constrained Actor-Critic and Constrained Natural Actor-Critic Algorithms

An Approximate Policy Iteration Viewpoint of Actor-Critic Algorithms

On the Global Convergence of Actor-Critic: A Case for Linear Quadratic Regulator with Ergodic Cost

A Two-Time-Scale Stochastic Optimization Framework with Applications in Control and Reinforcement Learning

Solving Time-Continuous Stochastic Optimal Control Problems: Algorithm Design and Convergence Analysis of Actor-Critic Flow

Improved Sample Complexity for Global Convergence of Actor-Critic Algorithms

Convergence Rates of Online Critic Value Function Approximation in Native Spaces

Addressing Function Approximation Error in Actor-Critic Methods

On the Global Convergence of Natural Actor-Critic with Two-layer Neural Network Parametrization

Value Improved Actor Critic Algorithms

Single Time-scale Actor-critic Method to Solve the Linear Quadratic Regulator with Convergence Guarantees

Boosting the Actor with Dual Critic

Warm-Start Actor-Critic: From Approximation Error to Sub-optimality Gap

Finite-Time Complexity of Online Primal-Dual Natural Actor-Critic Algorithm for Constrained Markov Decision Processes