Abstract:We study the problem of computing an optimal policy of an infinite-horizon discounted constrained Markov decision process (constrained MDP). Despite the popularity of Lagrangian-based policy search methods used in practice, the oscillation of policy iterates in these methods has not been fully understood, bringing out issues such as violation of constraints and sensitivity to hyper-parameters. To fill this gap, we employ the Lagrangian method to cast a constrained MDP into a constrained saddle-point problem in which max/min players correspond to primal/dual variables, respectively, and develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy. Specifically, we first propose a regularized policy gradient primal-dual (RPG-PD) method that updates the policy using an entropy-regularized policy gradient, and the dual variable via a quadratic-regularized gradient ascent, simultaneously. We prove that the policy primal-dual iterates of RPG-PD converge to a regularized saddle point with a sublinear rate, while the policy iterates converge sublinearly to an optimal constrained policy. We further instantiate RPG-PD in large state or action spaces by including function approximation in policy parametrization, and establish similar sublinear last-iterate policy convergence. Second, we propose an optimistic policy gradient primal-dual (OPG-PD) method that employs the optimistic gradient method to update primal/dual variables, simultaneously. We prove that the policy primal-dual iterates of OPG-PD converge to a saddle point that contains an optimal constrained policy, with a linear rate. To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.

On the Convergence of Projected Policy Gradient for Any Constant Step Sizes

On the Sublinear Convergence of Projected Policy Gradient for Any Constant Step Sizes

Projected Policy Gradient Converges in a Finite Number of Iterations

Elementary Analysis of Policy Gradient Methods

On the Linear Convergence of Natural Policy Gradient Algorithm

Convergence of Policy Gradient for Stochastic Linear-Quadratic Control Problem in Infinite Horizon

Stochastic Cubic-Regularized Policy Gradient Method

Convergence of Policy Gradient Methods for Finite-Horizon Exploratory Linear-Quadratic Control Problems

Deterministic Policy Gradients with General State Transitions

Convergence Rate of Projected Subgradient Method with Time-varying Step-sizes

Convergence rate analysis of distributed optimization with projected subgradient algorithm.

Accelerated Policy Gradient: On the Convergence Rates of the Nesterov Momentum for Reinforcement Learning

On the Convergence of Discounted Policy Gradient Methods

Convergence and sample complexity of natural policy gradient primal-dual methods for constrained MDPs

Global Convergence of Policy Gradient Methods in Reinforcement Learning, Games and Control

Model-Based Reparameterization Policy Gradient Methods: Theory and Practical Algorithms

Projective Proximal Gradient Descent for A Class of Nonconvex Nonsmooth Optimization Problems: Fast Convergence Without Kurdyka-Lojasiewicz (KL) Property

Last-Iterate Convergent Policy Gradient Primal-Dual Methods for Constrained MDPs

Projected Reflected Gradient Method with Larger Step Size for Monotone Variational Inequalities

Linear Convergence for Natural Policy Gradient with Log-linear Policy Parametrization

Convergence for Natural Policy Gradient on Infinite-State Queueing MDPs