A Policy Gradient Primal-Dual Algorithm for Constrained MDPs with Uniform PAC Guarantees

Toshinori Kitamura,Tadashi Kozuno,Masahiro Kato,Yuki Ichihara,Soichiro Nishimori,Akiyoshi Sannai,Sho Sonoda,Wataru Kumagai,Yutaka Matsuo
2024-07-01
Abstract:We study a primal-dual (PD) reinforcement learning (RL) algorithm for online constrained Markov decision processes (CMDPs). Despite its widespread practical use, the existing theoretical literature on PD-RL algorithms for this problem only provides sublinear regret guarantees and fails to ensure convergence to optimal policies. In this paper, we introduce a novel policy gradient PD algorithm with uniform probably approximate correctness (Uniform-PAC) guarantees, simultaneously ensuring convergence to optimal policies, sublinear regret, and polynomial sample complexity for any target accuracy. Notably, this represents the first Uniform-PAC algorithm for the online CMDP problem. In addition to the theoretical guarantees, we empirically demonstrate in a simple CMDP that our algorithm converges to optimal policies, while baseline algorithms exhibit oscillatory performance and constraint violation.
Machine Learning
What problem does this paper attempt to address?
This paper attempts to address the problem of insufficient theoretical results in existing Primal - Dual (PD) Reinforcement Learning (RL) algorithms in online Constrained Markov Decision Processes (CMDPs). Specifically, existing PD - RL algorithms only provide sub - linear regret guarantees and cannot ensure convergence to the optimal policy. Moreover, these algorithms may lead to performance oscillations and constraint violations in practical applications, which is unacceptable for application scenarios requiring stability and safety, such as autonomous driving, thermal power plant control, etc. ### Main problems in the paper 1. **Insufficient theoretical guarantees**: Existing PD - RL algorithms only provide sub - linear regret guarantees and cannot ensure that the performance of the final iterative policy is close to the optimal policy. 2. **Lack of convergence**: Existing algorithms fail to ensure convergence to the optimal policy, which may lead to oscillatory behavior and constraint violations. 3. **Stability problems in practical applications**: Due to the lack of strict convergence guarantees, existing algorithms may affect the stability and safety of the system in practical applications. ### Solutions in the paper To solve the above problems, the paper proposes a new policy - gradient - based Primal - Dual algorithm - UOpt - RPGPD (Uniform - PAC Optimistic Regularized Policy Gradient Primal - Dual), and introduces the following three key techniques: 1. **Regularized Lagrangian function**: By introducing regularization terms for policy entropy and Lagrangian multipliers, it is ensured that the algorithm can effectively explore the environment and gradually approach the optimal solution. 2. **Uniform - PAC exploration reward**: A special exploration reward mechanism is designed so that the algorithm can still maintain efficient exploration capabilities under an infinite iteration length. 3. **Adjusting regularization coefficients and learning rates**: By dynamically adjusting regularization coefficients and learning rates, the bias problem introduced by regularization is overcome, and it is ensured that the algorithm can converge to the optimal policy. ### Theoretical contributions UOpt - RPGPD achieves the Uniform - PAC guarantee for online CMDP problems for the first time, while ensuring sub - linear regret and polynomial sample complexity. This means that the algorithm can not only theoretically ensure convergence to the optimal policy, but also converge quickly in practice and avoid constraint violations. ### Experimental verification The experimental results show that UOpt - RPGPD can quickly converge to the optimal policy in a simple CMDP environment, while other baseline algorithms exhibit oscillatory behavior or constraint violations. This further verifies the effectiveness of the three proposed techniques. In conclusion, through the introduction of the UOpt - RPGPD algorithm, this paper solves the deficiencies of existing PD - RL algorithms in theoretical guarantees and practical applications, and provides a more robust and effective solution for online CMDP problems.