Abstract:SIAM Journal on Optimization, Volume 33, Issue 3, Page 2341-2378, September 2023. In this paper, we present a new policy gradient (PG) method, namely, the block policy mirror descent (BPMD) method, for solving a class of regularized reinforcement learning (RL) problems with (strongly) convex regularizers. Compared to the traditional PG methods with a batch update rule, which visits and updates the policy for every state, the BPMD method has cheap per-iteration computation via a partial update rule that performs the policy update on a sampled state. Despite the nonconvex nature of the problem and a partial update rule, we provide a unified analysis for several sampling schemes and show that BPMD achieves fast linear convergence to the global optimality. In particular, uniform sampling leads to worst-case total computational complexity comparable to batch PG methods. A necessary and sufficient condition for convergence with on-policy sampling is also identified. With a hybrid sampling scheme, we further show that BPMD enjoys potential instance-dependent acceleration, leading to improved dependence on the state space and consequently outperforming batch PG methods. We then extend BPMD methods to the stochastic setting by utilizing stochastic first-order information constructed from samples. With a generative model, [math] (resp., [math]) sample complexities are established for the strongly convex (resp., non-strongly convex) regularizers, where [math] denotes the target accuracy. To the best of our knowledge, this is the first time that block coordinate descent methods have been developed and analyzed for policy optimization in reinforcement learning, which provides a new perspective on solving large-scale RL problems.

Stochastic Cubic-Regularized Policy Gradient Method

Sample Complexity of Policy Gradient Finding Second-Order Stationary Points

Successive Convex Approximation Based Off-Policy Optimization for Constrained Reinforcement Learning

A Cubic-regularized Policy Newton Algorithm for Reinforcement Learning

Policy Gradient Method For Robust Reinforcement Learning

Learning Optimal Deterministic Policies with Stochastic Policy Gradients

Elementary Analysis of Policy Gradient Methods

Stochastic Policy Gradient Methods: Improved Sample Complexity for Fisher-non-degenerate Policies

CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee

Bridging the Gap between Newton-Raphson Method and Regularized Policy Iteration

Asynchronous Parallel Policy Gradient Methods for the Linear Quadratic Regulator

Stochastic Recursive Momentum for Policy Gradient Methods

A Single-Loop Robust Policy Gradient Method for Robust Markov Decision Processes

Variational Policy Gradient Method for Reinforcement Learning with General Utilities

Policy Gradient Converges to the Globally Optimal Policy for Nearly Linear-Quadratic Regulators

Regularly Updated Deterministic Policy Gradient Algorithm

A Payoff-Based Policy Gradient Method in Stochastic Games with Long-Run Average Payoffs

Fast Policy Learning for Linear Quadratic Control with Entropy Regularization

Stochastic Variance-Reduced Policy Gradient

Beyond Exact Gradients: Convergence of Stochastic Soft-Max Policy Gradient Methods with Entropy Regularization

Block Policy Mirror Descent