Block Policy Mirror Descent
Guanghui Lan,Yan Li,Tuo Zhao
DOI: https://doi.org/10.1137/22M1480409
IF: 2.763
2023-08-30
SIAM Journal on Optimization
Abstract:SIAM Journal on Optimization, Volume 33, Issue 3, Page 2341-2378, September 2023. In this paper, we present a new policy gradient (PG) method, namely, the block policy mirror descent (BPMD) method, for solving a class of regularized reinforcement learning (RL) problems with (strongly) convex regularizers. Compared to the traditional PG methods with a batch update rule, which visits and updates the policy for every state, the BPMD method has cheap per-iteration computation via a partial update rule that performs the policy update on a sampled state. Despite the nonconvex nature of the problem and a partial update rule, we provide a unified analysis for several sampling schemes and show that BPMD achieves fast linear convergence to the global optimality. In particular, uniform sampling leads to worst-case total computational complexity comparable to batch PG methods. A necessary and sufficient condition for convergence with on-policy sampling is also identified. With a hybrid sampling scheme, we further show that BPMD enjoys potential instance-dependent acceleration, leading to improved dependence on the state space and consequently outperforming batch PG methods. We then extend BPMD methods to the stochastic setting by utilizing stochastic first-order information constructed from samples. With a generative model, [math] (resp., [math]) sample complexities are established for the strongly convex (resp., non-strongly convex) regularizers, where [math] denotes the target accuracy. To the best of our knowledge, this is the first time that block coordinate descent methods have been developed and analyzed for policy optimization in reinforcement learning, which provides a new perspective on solving large-scale RL problems.
mathematics, applied