Finite-Sample Analysis of Off-Policy Natural Actor–Critic With Linear Function Approximation

Zaiwei Chen,Sajad Khodadadian,Siva Theja Maguluri
DOI: https://doi.org/10.1109/lcsys.2022.3172242
2022-01-01
IEEE Control Systems Letters
Abstract:In this letter, we develop a novel variant of natural actor-critic algorithm using off-policy sampling and linear function approximation, and establish a sample complexity of $mathcal {O}{(}epsilon ^{-3}{)}$ , outperforming all the previously known convergence bounds of such algorithms. In order to overcome the divergence due to deadly triad in off-policy policy evaluation under function approximation, we develop a critic that employs $n$ -step TD-learning algorithm with a properly chosen $n$ . We present finite-sample convergence bounds on this critic, which are of independent interest. Furthermore, we develop a variant of natural policy gradient under function approximation, with an improved convergence rate of $mathcal {O}(1/mathsf {T})$ after $mathsf {T}$ iterations. Combining the finite sample bounds of the actor and the critic, we obtain an overall $mathcal {O}{(}epsilon ^{-3}{)}$ sample complexity. Our results were derived solely based on the assumption that the behavior policy sufficiently explores the state-action space, which is a much lighter assumption compared to the related literature.
What problem does this paper attempt to address?