Abstract:Safe reinforcement learning (RL) aims to find the optimal policy and its feasible region in a constrained optimal control problem (OCP). Ensuring feasibility and optimality simultaneously has been a major challenge. Existing methods either attempt to solve OCPs directly with constrained optimization algorithms, leading to unstable training processes and unsatisfactory feasibility, or restrict policies in overly small feasible regions, resulting in excessive conservativeness with sacrificed optimality. To address this challenge, we propose an indirect safe RL framework called feasible policy iteration, which guarantees that the feasible region monotonically expands and converges to the maximum one, and the state-value function monotonically improves and converges to the optimal one. We achieve this by designing a policy update principle called region-wise policy improvement, which maximizes the state-value function under the constraint of the constraint decay function (CDF) inside the feasible region and minimizes the CDF outside the feasible region simultaneously. This update scheme ensures that the state-value function monotonically increases state-wise in the feasible region and the CDF monotonically decreases state-wise in the entire state space. We prove that the CDF converges to the solution of the risky Bellman equation while the state-value function converges to the solution of the feasible Bellman equation. The former represents the maximum feasible region and the latter manifests the optimal state-value function. Experiments show that our algorithm learns strictly safe and near-optimal policies with accurate feasible regions on classic control tasks. It also achieves fewer constraint violations with performance better than (or comparable to) baselines on Safety Gym.

Balancing Value Iteration and Policy Iteration for Discrete-Time Control.

Optimal Control for Constrained Discrete-Time Nonlinear Systems Based on Safe Reinforcement Learning.

Discrete-Time Nonlinear Optimal Control Using Multi-Step Reinforcement Learning

Generalized Policy Iteration-based Reinforcement Learning Algorithm for Optimal Control of Unknown Discrete-time Systems

Reinforcement Learning-Based Control for Nonlinear Discrete-Time Systems with Unknown Control Directions and Control Constraints

NN Reinforcement Learning Adaptive Control for a Class of Nonstrict-Feedback Discrete-Time Systems

Policy Iteration Reinforcement Learning Method for Continuous-Time Linear-Quadratic Mean-Field Control Problems

Integrating Classical Control into Reinforcement Learning Policy

Online Reinforcement Learning-based Neural Network Controller Design for Affine Nonlinear Discrete-time Systems.

Learning-based adaptive optimal control of linear time-delay systems: A value iteration approach

Value Iteration Based Integral Reinforcement Learning Approach for H∞ Controller Design of Continuous-Time Nonlinear Systems

Online Off-Policy Reinforcement Learning for Optimal Control of Unknown Nonlinear Systems Using Neural Networks

Feasible Policy Iteration

Adaptive Optimal Control of Discrete-Time Linear Systems with Discounted Value: Off-Policy Reinforcement Learning

Value Iteration-Based H∞ Controller Design for Continuous-Time Nonlinear Systems Subject to Input Constraints

Adaptive Optimal Robust Control for Uncertain Nonlinear Systems Using Neural Network Approximation in Policy Iteration

A Novel Policy Iteration Algorithm for Nonlinear Continuous-Time H$\infty$ Control Problem

Adaptive Optimal Control for a Class of Nonlinear Systems: the Online Policy Iteration Approach.

Data-based Approximate Policy Iteration for Affine Nonlinear Continuous-Time Optimal Control Design.

Policy Gradient Reinforcement Learning for Parameterized Continuous-Time Optimal Control

Discrete-Time Optimal Control Via Local Policy Iteration Adaptive Dynamic Programming