A Large Deviations Perspective on Policy Gradient Algorithms

Wouter Jongeneel,Daniel Kuhn,Mengmeng Li
2024-06-03
Abstract:Motivated by policy gradient methods in the context of reinforcement learning, we identify a large deviation rate function for the iterates generated by stochastic gradient descent for possibly non-convex objectives satisfying a Polyak-Łojasiewicz condition. Leveraging the contraction principle from large deviations theory, we illustrate the potential of this result by showing how convergence properties of policy gradient with a softmax parametrization and an entropy regularized objective can be naturally extended to a wide spectrum of other policy parametrizations.
Optimization and Control,Machine Learning
What problem does this paper attempt to address?
This paper mainly discusses the large deviation perspective of policy gradient algorithms in reinforcement learning. The authors identify a large deviation rate function generated by stochastic gradient descent iterations under the nonconvex objective that satisfies the Polyak-Łojasiewicz condition. By utilizing the contraction principle of large deviations theory, they demonstrate how to naturally extend the convergence properties of softmax parameterization and entropy regularized policy gradient algorithms to a wider range of other policy parameterizations. The paper points out that although policy gradient methods are widely used in reinforcement learning, the understanding of their global convergence is relatively late. Current analysis usually focuses on suboptimality under expectations while neglecting the impact of different parameterizations on convergence behavior. Therefore, the aim of the paper is to determine the sharp convergence rate in a probabilistic sense and provide a unified framework for understanding the performance under different policy parameterizations. The main contributions of the paper include: 1. Providing a lower bound for the iteration of softmax policy gradient algorithm with entropy regularization objective with high probability. 2. Establishing a large deviation rate for a series of table policy parameterizations using the contraction principle. 3. Providing an upper bound for large deviations that satisfy the PL condition in stochastic gradient descent, which is an independent point of interest. The paper also introduces the background of Markov Decision Processes (MDPs) and entropy regularization in reinforcement learning. It presents a preliminary exponential convergence probability bound for the value function and demonstrates a large deviation upper bound applicable to policy gradient iterations under non-uniform Polyak-Łojasiewicz conditions. Lastly, the paper discusses the possibility of generalizing from softmax parameterization to a wider parameterization family, providing high probability exponential convergence rates for existing and new parameterizations.