Actor-Critic or Critic-Actor? A Tale of Two Time Scales

Shalabh Bhatnagar,Vivek S. Borkar,Soumyajit Guin
DOI: https://doi.org/10.1109/LCSYS.2023.3288931
2024-06-12
Abstract:We revisit the standard formulation of tabular actor-critic algorithm as a two time-scale stochastic approximation with value function computed on a faster time-scale and policy computed on a slower time-scale. This emulates policy iteration. We observe that reversal of the time scales will in fact emulate value iteration and is a legitimate algorithm. We provide a proof of convergence and compare the two empirically with and without function approximation (with both linear and nonlinear function approximators) and observe that our proposed critic-actor algorithm performs on par with actor-critic in terms of both accuracy and computational effort.
Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to explore whether an effective algorithm can be formed when the time scale of the Actor - Critic algorithm is reversed in reinforcement learning (that is, the Critic runs on a slower time scale and the Actor runs on a faster time scale), and to analyze its performance. Specifically: 1. **Standard Actor - Critic algorithm**: Traditionally, the Actor - Critic algorithm uses two time scales for updates, where the value function (Critic) is updated on a faster time scale, and the policy (Actor) is updated on a slower time scale. This setting simulates Policy Iteration and has been widely studied and applied. 2. **Proposing the Critic - Actor algorithm**: The paper proposes a new algorithm - the Critic - Actor algorithm, which reverses the time scale, that is, the Critic is updated on a slower time scale, and the Actor is updated on a faster time scale. This setting simulates Value Iteration. The author aims to verify the effectiveness of this new algorithm and compare it with the traditional Actor - Critic algorithm. 3. **Theoretical proof and experimental verification**: In order to prove the effectiveness of the Critic - Actor algorithm, the author provides a theoretical proof of the algorithm's convergence and compares the two algorithms through a series of experiments (including tabular form and function approximation form). The experimental results show that the Critic - Actor algorithm is comparable to the Actor - Critic algorithm in terms of accuracy and computational efficiency, and even performs slightly better in some cases. ### Formula summary - **Value function update formula** (the Critic part in the Critic - Actor algorithm): \[ V_{n + 1}(i)=V_n(i)+a(\nu_1(i, n))\left[g(i, \phi_n(i), \xi_n(i, \phi_n(i)))+ \gamma V_n(\xi_n(i, \phi_n(i)))-V_n(i)\right]I\{Y_n = i\} \] - **Policy update formula** (the Actor part in the Critic - Actor algorithm): \[ \theta_{n + 1}(i, a)=\Gamma_{\theta_0}\left(\theta_n(i, a)+b(\nu_2(i, a, n))\left[V_n(i)-g(i, a, \eta_n(i, a))-\gamma V_n(\eta_n(i, a))\right]I\{Z_n=(i, a)\}\right) \] - **Bellman equation**: \[ V^*(i)=\min_{a\in U(i)}\sum_{j\in S}p(i, a, j)\left(g(i, a, j)+\gamma V^*(j)\right) \] Through these formulas and theoretical analysis, the paper demonstrates the rationality and effectiveness of the Critic - Actor algorithm and provides a basis for further research.