Abstract:We study a primal-dual (PD) reinforcement learning (RL) algorithm for online constrained Markov decision processes (CMDPs). Despite its widespread practical use, the existing theoretical literature on PD-RL algorithms for this problem only provides sublinear regret guarantees and fails to ensure convergence to optimal policies. In this paper, we introduce a novel policy gradient PD algorithm with uniform probably approximate correctness (Uniform-PAC) guarantees, simultaneously ensuring convergence to optimal policies, sublinear regret, and polynomial sample complexity for any target accuracy. Notably, this represents the first Uniform-PAC algorithm for the online CMDP problem. In addition to the theoretical guarantees, we empirically demonstrate in a simple CMDP that our algorithm converges to optimal policies, while baseline algorithms exhibit oscillatory performance and constraint violation.

What problem does this paper attempt to address?

This paper attempts to address the problem of insufficient theoretical results in existing Primal - Dual (PD) Reinforcement Learning (RL) algorithms in online Constrained Markov Decision Processes (CMDPs). Specifically, existing PD - RL algorithms only provide sub - linear regret guarantees and cannot ensure convergence to the optimal policy. Moreover, these algorithms may lead to performance oscillations and constraint violations in practical applications, which is unacceptable for application scenarios requiring stability and safety, such as autonomous driving, thermal power plant control, etc. ### Main problems in the paper 1. **Insufficient theoretical guarantees**: Existing PD - RL algorithms only provide sub - linear regret guarantees and cannot ensure that the performance of the final iterative policy is close to the optimal policy. 2. **Lack of convergence**: Existing algorithms fail to ensure convergence to the optimal policy, which may lead to oscillatory behavior and constraint violations. 3. **Stability problems in practical applications**: Due to the lack of strict convergence guarantees, existing algorithms may affect the stability and safety of the system in practical applications. ### Solutions in the paper To solve the above problems, the paper proposes a new policy - gradient - based Primal - Dual algorithm - UOpt - RPGPD (Uniform - PAC Optimistic Regularized Policy Gradient Primal - Dual), and introduces the following three key techniques: 1. **Regularized Lagrangian function**: By introducing regularization terms for policy entropy and Lagrangian multipliers, it is ensured that the algorithm can effectively explore the environment and gradually approach the optimal solution. 2. **Uniform - PAC exploration reward**: A special exploration reward mechanism is designed so that the algorithm can still maintain efficient exploration capabilities under an infinite iteration length. 3. **Adjusting regularization coefficients and learning rates**: By dynamically adjusting regularization coefficients and learning rates, the bias problem introduced by regularization is overcome, and it is ensured that the algorithm can converge to the optimal policy. ### Theoretical contributions UOpt - RPGPD achieves the Uniform - PAC guarantee for online CMDP problems for the first time, while ensuring sub - linear regret and polynomial sample complexity. This means that the algorithm can not only theoretically ensure convergence to the optimal policy, but also converge quickly in practice and avoid constraint violations. ### Experimental verification The experimental results show that UOpt - RPGPD can quickly converge to the optimal policy in a simple CMDP environment, while other baseline algorithms exhibit oscillatory behavior or constraint violations. This further verifies the effectiveness of the three proposed techniques. In conclusion, through the introduction of the UOpt - RPGPD algorithm, this paper solves the deficiencies of existing PD - RL algorithms in theoretical guarantees and practical applications, and provides a more robust and effective solution for online CMDP problems.

A Policy Gradient Primal-Dual Algorithm for Constrained MDPs with Uniform PAC Guarantees

Convergence Rate of Primal-Dual Approach to Constrained Reinforcement Learning with Softmax Policy

Successive Convex Approximation Based Off-Policy Optimization for Constrained Reinforcement Learning

Learning General Parameterized Policies for Infinite Horizon Average Reward Constrained MDPs via Primal-Dual Policy Gradient Algorithm

A Near-Optimal Primal-Dual Method for Off-Policy Learning in CMDP

Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes

Policy-based Primal-Dual Methods for Concave CMDP with Variance Reduction

Deterministic Policy Gradient Primal-Dual Methods for Continuous-Space Constrained MDPs

Accelerated Primal-Dual Policy Optimization for Safe Reinforcement Learning

Towards Painless Policy Optimization for Constrained MDPs

Achieving Zero Constraint Violation for Constrained Reinforcement Learning via Conservative Natural Policy Gradient Primal-Dual Algorithm

Last-Iterate Convergent Policy Gradient Primal-Dual Methods for Constrained MDPs

Truly No-Regret Learning in Constrained MDPs

A Primal-Dual-Critic Algorithm for Offline Constrained Reinforcement Learning

Policy Gradient in Robust MDPs with Global Convergence Guarantee

Adaptive Primal-Dual Method for Safe Reinforcement Learning

Optimal Strong Regret and Violation in Constrained MDPs via Policy Optimization

A safe exploration approach to constrained Markov decision processes

Sample-Efficient Constrained Reinforcement Learning with General Parameterization

CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee

Last-Iterate Convergence of General Parameterized Policies in Constrained MDPs