Abstract:SIAM Journal on Optimization, Volume 33, Issue 3, Page 2341-2378, September 2023. In this paper, we present a new policy gradient (PG) method, namely, the block policy mirror descent (BPMD) method, for solving a class of regularized reinforcement learning (RL) problems with (strongly) convex regularizers. Compared to the traditional PG methods with a batch update rule, which visits and updates the policy for every state, the BPMD method has cheap per-iteration computation via a partial update rule that performs the policy update on a sampled state. Despite the nonconvex nature of the problem and a partial update rule, we provide a unified analysis for several sampling schemes and show that BPMD achieves fast linear convergence to the global optimality. In particular, uniform sampling leads to worst-case total computational complexity comparable to batch PG methods. A necessary and sufficient condition for convergence with on-policy sampling is also identified. With a hybrid sampling scheme, we further show that BPMD enjoys potential instance-dependent acceleration, leading to improved dependence on the state space and consequently outperforming batch PG methods. We then extend BPMD methods to the stochastic setting by utilizing stochastic first-order information constructed from samples. With a generative model, [math] (resp., [math]) sample complexities are established for the strongly convex (resp., non-strongly convex) regularizers, where [math] denotes the target accuracy. To the best of our knowledge, this is the first time that block coordinate descent methods have been developed and analyzed for policy optimization in reinforcement learning, which provides a new perspective on solving large-scale RL problems.

Policy Mirror Descent for Regularized Reinforcement Learning: A Generalized Framework with Linear Convergence

Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes

Policy Optimization with Stochastic Mirror Descent.

Block Policy Mirror Descent

Convergence Rate of Primal-Dual Approach to Constrained Reinforcement Learning with Softmax Policy

On the Convergence of Policy in Unregularized Generalized Policy Mirror Descent

Homotopic Policy Mirror Descent: Policy Convergence, Implicit Regularization, and Improved Sample Complexity

A Novel Framework for Policy Mirror Descent with General Parameterization and Linear Convergence

On the Convergence of Policy in Unregularized Policy Mirror Descent

A Unified Approach to Controlling Implicit Regularization via Mirror Descent

Linear Convergence for Natural Policy Gradient with Log-linear Policy Parametrization

Convex Regularization and Convergence of Policy Gradient Flows under Safety Constraints

Learning mirror maps in policy mirror descent

Entropy annealing for policy mirror descent in continuous time and space

Policy Mirror Descent with Lookahead

Policy Optimization over General State and Action Spaces

Sparse Q-learning with Mirror Descent

Efficient Model-Based Concave Utility Reinforcement Learning through Greedy Mirror Descent

Policy Gradient for Robust Markov Decision Processes

Stochastic Cubic-Regularized Policy Gradient Method