Abstract:We study the problem of computing an optimal policy of an infinite-horizon discounted constrained Markov decision process (constrained MDP). Despite the popularity of Lagrangian-based policy search methods used in practice, the oscillation of policy iterates in these methods has not been fully understood, bringing out issues such as violation of constraints and sensitivity to hyper-parameters. To fill this gap, we employ the Lagrangian method to cast a constrained MDP into a constrained saddle-point problem in which max/min players correspond to primal/dual variables, respectively, and develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy. Specifically, we first propose a regularized policy gradient primal-dual (RPG-PD) method that updates the policy using an entropy-regularized policy gradient, and the dual variable via a quadratic-regularized gradient ascent, simultaneously. We prove that the policy primal-dual iterates of RPG-PD converge to a regularized saddle point with a sublinear rate, while the policy iterates converge sublinearly to an optimal constrained policy. We further instantiate RPG-PD in large state or action spaces by including function approximation in policy parametrization, and establish similar sublinear last-iterate policy convergence. Second, we propose an optimistic policy gradient primal-dual (OPG-PD) method that employs the optimistic gradient method to update primal/dual variables, simultaneously. We prove that the policy primal-dual iterates of OPG-PD converge to a saddle point that contains an optimal constrained policy, with a linear rate. To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.

Convergence Rate of Primal-Dual Approach to Constrained Reinforcement Learning with Softmax Policy

Successive Convex Approximation Based Off-Policy Optimization for Constrained Reinforcement Learning

Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes

A Policy Gradient Primal-Dual Algorithm for Constrained MDPs with Uniform PAC Guarantees

Policy-based Primal-Dual Methods for Concave CMDP with Variance Reduction

Convergence and sample complexity of natural policy gradient primal-dual methods for constrained MDPs

Last-Iterate Convergent Policy Gradient Primal-Dual Methods for Constrained MDPs

Learning General Parameterized Policies for Infinite Horizon Average Reward Constrained MDPs via Primal-Dual Policy Gradient Algorithm

Deterministic Policy Gradient Primal-Dual Methods for Continuous-Space Constrained MDPs

A Near-Optimal Primal-Dual Method for Off-Policy Learning in CMDP

Accelerated Primal-Dual Policy Optimization for Safe Reinforcement Learning

Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning

CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee

Global Convergence of Policy Gradient Primal-dual Methods for Risk-constrained LQRs

Achieving Zero Constraint Violation for Constrained Reinforcement Learning via Conservative Natural Policy Gradient Primal-Dual Algorithm

Double Duality: Variational Primal-Dual Policy Optimization for Constrained Reinforcement Learning

Adaptive Primal-Dual Method for Safe Reinforcement Learning

Off-Policy Primal-Dual Safe Reinforcement Learning

A Primal-Dual-Critic Algorithm for Offline Constrained Reinforcement Learning

A policy gradient approach for Finite Horizon Constrained Markov Decision Processes

Policy Gradient Methods for the Cost-Constrained LQR: Strong Duality and Global Convergence