Abstract:Motivated by policy gradient methods in the context of reinforcement learning, we identify a large deviation rate function for the iterates generated by stochastic gradient descent for possibly non-convex objectives satisfying a Polyak-Łojasiewicz condition. Leveraging the contraction principle from large deviations theory, we illustrate the potential of this result by showing how convergence properties of policy gradient with a softmax parametrization and an entropy regularized objective can be naturally extended to a wide spectrum of other policy parametrizations.

What problem does this paper attempt to address?

This paper mainly discusses the large deviation perspective of policy gradient algorithms in reinforcement learning. The authors identify a large deviation rate function generated by stochastic gradient descent iterations under the nonconvex objective that satisfies the Polyak-Łojasiewicz condition. By utilizing the contraction principle of large deviations theory, they demonstrate how to naturally extend the convergence properties of softmax parameterization and entropy regularized policy gradient algorithms to a wider range of other policy parameterizations. The paper points out that although policy gradient methods are widely used in reinforcement learning, the understanding of their global convergence is relatively late. Current analysis usually focuses on suboptimality under expectations while neglecting the impact of different parameterizations on convergence behavior. Therefore, the aim of the paper is to determine the sharp convergence rate in a probabilistic sense and provide a unified framework for understanding the performance under different policy parameterizations. The main contributions of the paper include: 1. Providing a lower bound for the iteration of softmax policy gradient algorithm with entropy regularization objective with high probability. 2. Establishing a large deviation rate for a series of table policy parameterizations using the contraction principle. 3. Providing an upper bound for large deviations that satisfy the PL condition in stochastic gradient descent, which is an independent point of interest. The paper also introduces the background of Markov Decision Processes (MDPs) and entropy regularization in reinforcement learning. It presents a preliminary exponential convergence probability bound for the value function and demonstrates a large deviation upper bound applicable to policy gradient iterations under non-uniform Polyak-Łojasiewicz conditions. Lastly, the paper discusses the possibility of generalizing from softmax parameterization to a wider parameterization family, providing high probability exponential convergence rates for existing and new parameterizations.

A Large Deviations Perspective on Policy Gradient Algorithms

Elementary Analysis of Policy Gradient Methods

Enhancing Policy Gradient with the Polyak Step-Size Adaption

When Do Off-Policy and On-Policy Policy Gradient Methods Align?

Stochastic Cubic-Regularized Policy Gradient Method

Beyond Exact Gradients: Convergence of Stochastic Soft-Max Policy Gradient Methods with Entropy Regularization

Towards Principled, Practical Policy Gradient for Bandits and Tabular MDPs

Behind the Myth of Exploration in Policy Gradients

Variational Policy Gradient Method for Reinforcement Learning with General Utilities

Mollification Effects of Policy Gradient Methods

Policy Gradient Algorithms Implicitly Optimize by Continuation

Policy Gradient for Robust Markov Decision Processes

Learning Optimal Deterministic Policies with Stochastic Policy Gradients

Strongly-Polynomial Time and Validation Analysis of Policy Gradient Methods

A Payoff-Based Policy Gradient Method in Stochastic Games with Long-Run Average Payoffs

The Definitive Guide to Policy Gradients in Deep Reinforcement Learning: Theory, Algorithms and Implementations

Global Convergence of Policy Gradient Methods in Reinforcement Learning, Games and Control

Increasing Entropy to Boost Policy Gradient Performance on Personalization Tasks

Policy Gradient in Robust MDPs with Global Convergence Guarantee

Policy Gradient Method For Robust Reinforcement Learning