Abstract:Reinforcement Learning (RL) is a powerful method for controlling dynamic systems, but its learning mechanism can lead to unpredictable actions that undermine the safety of critical systems. Here, we propose RL with Adaptive Regularization (RL-AR), an algorithm that enables safe RL exploration by combining the RL policy with a policy regularizer that hard-codes the safety constraints. RL-AR performs policy combination via a "focus module," which determines the appropriate combination depending on the state--relying more on the safe policy regularizer for less-exploited states while allowing unbiased convergence for well-exploited states. In a series of critical control applications, we demonstrate that RL-AR not only ensures safety during training but also achieves a return competitive with the standards of model-free RL that disregards safety.

What problem does this paper attempt to address?

The key problem that this paper attempts to solve is: when applying Reinforcement Learning (RL) in critical systems, how to ensure the safety of the learning process. Specifically, the paper proposes a new algorithm - Reinforcement Learning with Adaptive Regularization (RL - AR), aiming to solve the following problems: 1. **Safety of Critical Systems**: - In critical systems in fields such as medicine and engineering, control actions must ensure that they do not damage the functions of the system. For example, in scenarios such as nuclear fusion management, robotic surgery, and patient treatment strategies, any unsafe operation may lead to serious consequences. - Traditional RL methods rely on a trial - and - error mechanism when exploring the optimal policy, which may violate the safety constraints in critical systems. 2. **Limitations of Existing Safe RL Algorithms**: - Existing safe RL algorithms are either unable to ensure safety during the training phase or require a large amount of computational overhead for action verification. - Although classical control methods are reliable, their performance heavily depends on the accuracy of the environmental model, and in practical applications, an accurate environmental model is often difficult to obtain. 3. **Requirements for Single - Life Applications**: - In "single - life" application scenarios, the control system must avoid unsafe operations from the first attempt. For example, when formulating a control strategy to regulate a patient's health status, any harm to the patient during the policy exploration process cannot be tolerated. ### Solutions Proposed in the Paper To solve the above problems, the paper proposes the RL - AR algorithm, which has the following main features: - **Combining Safety Regularization Policy with Adaptive RL Policy**: - RL - AR combines the safety regularizer with the adaptive RL agent by introducing a "focus module". The focus module determines the combination method of the two strategies according to the current state. - **Ensuring Safety in the Initial Stage**: - In the early stage of training, the focus module gives priority to the safety regularizer to ensure that the system is always in a safe state. As the understanding of the environment gradually increases, the focus module will gradually increase the weight of the adaptive RL strategy, thereby achieving better control performance. - **Adaptability and Convergence**: - RL - AR not only ensures the safety of the training process but also can achieve a return comparable to that of traditional RL methods without safety constraints in the later stage of training. At the same time, this algorithm can gradually converge to the optimal RL strategy. ### Summary of Mathematical Formulas 1. **Definition of Markov Decision Process (MDP)**: \[ M=(S, A, P, r, \gamma) \] where \(S\) is the finite state set, \(A\) is the action space, \(P: S\times A\rightarrow P(S)\) is the state transition function, \(r: S\times A\rightarrow[-R_{\text{max}}, R_{\text{max}}]\) is the reward function, and \(\gamma\in(0, 1)\) is the discount factor. 2. **Value Function and Action - Value Function**: \[ V^{\pi}(s_t)=\mathbb{E}_{a_t, s_{t + 1},\ldots}\left[\sum_{i = 0}^{\infty}\gamma^i r(s_{t + i}, a_{t + i})\right] \] \[ Q^{\pi}(s_t, a_t)=\mathbb{E}_{s_{t+1},\ldots}\left[\sum_{i = 0}^{\infty}\gamma^i r(s_{t + i}, a_{t + i})\right] \] 3. **Bellman Equation**: \[ V^{\pi}(s)=\mathbb{E}

Reinforcement Learning with Adaptive Regularization for Safe Control of Critical Systems

Cautious Adaptation For Reinforcement Learning in Safety-Critical Settings

Safe Reinforcement Learning Using Robust Control Barrier Functions

Safe Reinforcement Learning via Hierarchical Adaptive Chance-Constraint Safeguards

Model-Based Safe Reinforcement Learning with Time-Varying State and Control Constraints: An Application to Intelligent Vehicles

Model-Based Safe Reinforcement Learning With Time-Varying Constraints: Applications to Intelligent Vehicles

End-to-End Safe Reinforcement Learning through Barrier Functions for Safety-Critical Continuous Control Tasks

Concurrent Learning of Policy and Unknown Safety Constraints in Reinforcement Learning

Progressive Adaptive Chance-Constrained Safeguards for Reinforcement Learning.

Safe reinforcement learning for probabilistic reachability and safety specifications: A Lyapunov-based approach

Safe Reinforcement Learning with Dual Robustness

Evaluating Model-free Reinforcement Learning Toward Safety-critical Tasks

Safe Deep Policy Adaptation

Learning to be Safe: Deep RL with a Safety Critic

Model-Free Safe Reinforcement Learning Through Neural Barrier Certificate

ActSafe: Active Exploration with Safety Constraints for Reinforcement Learning

Improving the Robustness of Reinforcement Learning Policies with $\mathcal{L}_{1}$ Adaptive Control

Learning Adaptive Safety for Multi-Agent Systems

Safe Model-Based Reinforcement Learning with an Uncertainty-Aware Reachability Certificate

Safety reinforcement learning control via transfer learning

Long and Short-Term Constraints Driven Safe Reinforcement Learning for Autonomous Driving