Abstract:In this paper, we consider the problem of learning safe policies for probabilistic-constrained reinforcement learning (RL). Specifically, a safe policy or controller is one that, with high probability, maintains the trajectory of the agent in a given safe set. We establish a connection between this probabilisticconstrained setting and the cumulative-constrained formulation that is frequently explored in the existing literature. We provide theoretical bounds elucidating that the probabilistic-constrained setting offers a better trade-off in terms of optimality and safety (constraint satisfaction). The challenge encountered when dealing with the probabilistic constraints, as explored in this work, arises from the absence of explicit expressions for their gradients. Our prior work provides such an explicit gradient expression for probabilistic constraints which we term Safe Policy Gradient-REINFORCE (SPG-REINFORCE). In this work, we provide an improved gradient SPG-Actor-Critic that leads to a lower variance than SPG-REINFORCE, which is substantiated by our theoretical results. A noteworthy aspect of both SPGs is their inherent algorithm independence, rendering them versatile for application across a range of policy-based algorithms. Furthermore, we propose a Safe Primal-Dual algorithm that can leverage both SPGs to learn safe policies. It is subsequently followed by theoretical analyses that encompass the convergence of the algorithm, as well as the near-optimality and feasibility on average. In addition, we test the proposed approaches by a series of empirical experiments. These experiments aim to examine and analyze the inherent trade-offs between the optimality and safety, and serve to substantiate the efficacy of two SPGs, as well as our theoretical contributions.

Safe Batch Constrained Deep Reinforcement Learning with Generative Adversarial Network

Safe Reinforcement Learning Using Finite-Horizon Gradient-based Estimation

Offline Reinforcement Learning via High-Fidelity Generative Behavior Modeling

Constrained Policy Optimization with Explicit Behavior Density for Offline Reinforcement Learning

Iterative Batch Reinforcement Learning via Safe Diversified Model-based Policy Search

Robust Offline Reinforcement Learning from Low-Quality Data

Constrained Latent Action Policies for Model-Based Offline Reinforcement Learning

DiffPoGAN: Diffusion Policies with Generative Adversarial Networks for Offline Reinforcement Learning

Combined Constraint on Behavior Cloning and Discriminator in Offline Reinforcement Learning

Model-based Safe Deep Reinforcement Learning via a Constrained Proximal Policy Optimization Algorithm

Enforcing Hard Constraints with Soft Barriers: Safe Reinforcement Learning in Unknown Stochastic Environments

Provably Efficient Generalized Lagrangian Policy Optimization for Safe Multi-Agent Reinforcement Learning

Safe Reinforcement Learning via Hierarchical Adaptive Chance-Constraint Safeguards

Constraints Penalized Q-learning for Safe Offline Reinforcement Learning.

Probabilistic Constraint for Safety-Critical Reinforcement Learning

Adaptive Advantage-Guided Policy Regularization for Offline Reinforcement Learning

Learning Constrained Distributions of Robot Configurations with Generative Adversarial Network

Safe Reinforcement Learning for Constrained Markov Decision Processes with Stochastic Stopping Time

Robust Safe Reinforcement Learning under Adversarial Disturbances

GenSafe: A Generalizable Safety Enhancer for Safe Reinforcement Learning Algorithms Based on Reduced Order Markov Decision Process Model

Towards Robust and Safe Reinforcement Learning with Benign Off-policy Data.