Abstract:Safe reinforcement learning (RL) aims to find the optimal policy and its feasible region in a constrained optimal control problem (OCP). Ensuring feasibility and optimality simultaneously has been a major challenge. Existing methods either attempt to solve OCPs directly with constrained optimization algorithms, leading to unstable training processes and unsatisfactory feasibility, or restrict policies in overly small feasible regions, resulting in excessive conservativeness with sacrificed optimality. To address this challenge, we propose an indirect safe RL framework called feasible policy iteration, which guarantees that the feasible region monotonically expands and converges to the maximum one, and the state-value function monotonically improves and converges to the optimal one. We achieve this by designing a policy update principle called region-wise policy improvement, which maximizes the state-value function under the constraint of the constraint decay function (CDF) inside the feasible region and minimizes the CDF outside the feasible region simultaneously. This update scheme ensures that the state-value function monotonically increases state-wise in the feasible region and the CDF monotonically decreases state-wise in the entire state space. We prove that the CDF converges to the solution of the risky Bellman equation while the state-value function converges to the solution of the feasible Bellman equation. The former represents the maximum feasible region and the latter manifests the optimal state-value function. Experiments show that our algorithm learns strictly safe and near-optimal policies with accurate feasible regions on classic control tasks. It also achieves fewer constraint violations with performance better than (or comparable to) baselines on Safety Gym.

Online Pareto optimal control of mean-ﬁeld stochastic multi-player systems using policy iteration

Optimal Control for Constrained Discrete-Time Nonlinear Systems Based on Safe Reinforcement Learning.

Approximate Policy Iteration for Robust Stochastic Control of Multi-agent Markov Decision Processes

Policy Iteration Reinforcement Learning Method for Continuous-Time Linear-Quadratic Mean-Field Control Problems

RL-Driven MPPI: Accelerating Online Control Laws Calculation with Offline Policy

Effect of stable xenon inhalation on internal carotid artery blood flow in unanesthetized monkeys.

Discrete-Time Nonzero-Sum Games for Multiplayer Using Policy-Iteration-Based Adaptive Dynamic Programming Algorithms

Data-Efficient Off-Policy Learning for Distributed Optimal Tracking Control of HMAS with Unidentified Exosystem Dynamics.

Mixed Reinforcement Learning for Efficient Policy Optimization in Stochastic Environments

Online Off-Policy Reinforcement Learning for Optimal Control of Unknown Nonlinear Systems Using Neural Networks

Value Iteration-Based Cooperative Adaptive Optimal Control for Multi-Player Differential Games With Incomplete Information

A Novel Policy Iteration Algorithm for Nonlinear Continuous-Time H$\infty$ Control Problem

Online policy iteration algorithm for semi-Markov switching state-space control processes

Feasible Policy Iteration

Linear-Quadratic Pareto Cooperative Game for Mean-Field Backward Stochastic System

Policy Iteration Based Feedback Control

Mean Field LQG Social Optimization: A Reinforcement Learning Approach

Distributed Optimal Control of Nonlinear System Based on Policy Gradient with External Disturbance

Pareto Optimal Cooperative Control of Mean-Field Backward Stochastic Differential System in Finite Horizon

Optimal Control of Robust Team Stochastic Games

Scaling policy iteration based reinforcement learning for unknown discrete-time linear systems