Abstract:Safe reinforcement learning (RL) aims to find the optimal policy and its feasible region in a constrained optimal control problem (OCP). Ensuring feasibility and optimality simultaneously has been a major challenge. Existing methods either attempt to solve OCPs directly with constrained optimization algorithms, leading to unstable training processes and unsatisfactory feasibility, or restrict policies in overly small feasible regions, resulting in excessive conservativeness with sacrificed optimality. To address this challenge, we propose an indirect safe RL framework called feasible policy iteration, which guarantees that the feasible region monotonically expands and converges to the maximum one, and the state-value function monotonically improves and converges to the optimal one. We achieve this by designing a policy update principle called region-wise policy improvement, which maximizes the state-value function under the constraint of the constraint decay function (CDF) inside the feasible region and minimizes the CDF outside the feasible region simultaneously. This update scheme ensures that the state-value function monotonically increases state-wise in the feasible region and the CDF monotonically decreases state-wise in the entire state space. We prove that the CDF converges to the solution of the risky Bellman equation while the state-value function converges to the solution of the feasible Bellman equation. The former represents the maximum feasible region and the latter manifests the optimal state-value function. Experiments show that our algorithm learns strictly safe and near-optimal policies with accurate feasible regions on classic control tasks. It also achieves fewer constraint violations with performance better than (or comparable to) baselines on Safety Gym.

Policy Iteration Reinforcement Learning Method for Continuous-Time Linear-Quadratic Mean-Field Control Problems

Optimal Control for Constrained Discrete-Time Nonlinear Systems Based on Safe Reinforcement Learning.

Robust policy iteration for continuous-time stochastic $H_\infty$ control problem with unknown dynamics

Reinforcement Learning for a Discrete-Time Linear-Quadratic Control Problem with an Application

A Reinforcement Learning Method for LQR Control Problem

Data-driven policy iteration algorithm for continuous-time stochastic linear-quadratic optimal control problems

Full error analysis of policy gradient learning algorithms for exploratory linear quadratic mean-field control problem in continuous time with common noise

Sublinear Regret for a Class of Continuous-Time Linear--Quadratic Reinforcement Learning Problems

Reinforcement Learning for Finite-Horizon H∞ Tracking Control of Unknown Discrete Linear Time-Varying System

Reinforcement Learning Policies in Continuous-Time Linear Systems

Continuous-time q-learning for mean-field control problems

Feasible Policy Iteration

A Novel Policy Iteration Algorithm for Nonlinear Continuous-Time H$\infty$ Control Problem

Deep Reinforcement Learning for Infinite Horizon Mean Field Problems in Continuous Spaces

Fast Policy Learning for Linear Quadratic Control with Entropy Regularization

Reinforcement Learning-Based Control for Nonlinear Discrete-Time Systems with Unknown Control Directions and Control Constraints

Two‐loop reinforcement learning algorithm for finite‐horizon optimal control of continuous‐time affine nonlinear systems

Unified Reinforcement Q-Learning for Mean Field Game and Control Problems

RL-Driven MPPI: Accelerating Online Control Laws Calculation with Offline Policy

A policy iteration algorithm for non-Markovian control problems

Unified continuous-time q-learning for mean-field game and mean-field control problems