Abstract:Reinforcement Learning (RL) is a powerful method for controlling dynamic systems, but its learning mechanism can lead to unpredictable actions that undermine the safety of critical systems. Here, we propose RL with Adaptive Regularization (RL-AR), an algorithm that enables safe RL exploration by combining the RL policy with a policy regularizer that hard-codes the safety constraints. RL-AR performs policy combination via a "focus module," which determines the appropriate combination depending on the state--relying more on the safe policy regularizer for less-exploited states while allowing unbiased convergence for well-exploited states. In a series of critical control applications, we demonstrate that RL-AR not only ensures safety during training but also achieves a return competitive with the standards of model-free RL that disregards safety.
What problem does this paper attempt to address?
The key problem that this paper attempts to solve is: when applying Reinforcement Learning (RL) in critical systems, how to ensure the safety of the learning process. Specifically, the paper proposes a new algorithm - Reinforcement Learning with Adaptive Regularization (RL - AR), aiming to solve the following problems:
1. **Safety of Critical Systems**:
- In critical systems in fields such as medicine and engineering, control actions must ensure that they do not damage the functions of the system. For example, in scenarios such as nuclear fusion management, robotic surgery, and patient treatment strategies, any unsafe operation may lead to serious consequences.
- Traditional RL methods rely on a trial - and - error mechanism when exploring the optimal policy, which may violate the safety constraints in critical systems.
2. **Limitations of Existing Safe RL Algorithms**:
- Existing safe RL algorithms are either unable to ensure safety during the training phase or require a large amount of computational overhead for action verification.
- Although classical control methods are reliable, their performance heavily depends on the accuracy of the environmental model, and in practical applications, an accurate environmental model is often difficult to obtain.
3. **Requirements for Single - Life Applications**:
- In "single - life" application scenarios, the control system must avoid unsafe operations from the first attempt. For example, when formulating a control strategy to regulate a patient's health status, any harm to the patient during the policy exploration process cannot be tolerated.
### Solutions Proposed in the Paper
To solve the above problems, the paper proposes the RL - AR algorithm, which has the following main features:
- **Combining Safety Regularization Policy with Adaptive RL Policy**:
- RL - AR combines the safety regularizer with the adaptive RL agent by introducing a "focus module". The focus module determines the combination method of the two strategies according to the current state.
- **Ensuring Safety in the Initial Stage**:
- In the early stage of training, the focus module gives priority to the safety regularizer to ensure that the system is always in a safe state. As the understanding of the environment gradually increases, the focus module will gradually increase the weight of the adaptive RL strategy, thereby achieving better control performance.
- **Adaptability and Convergence**:
- RL - AR not only ensures the safety of the training process but also can achieve a return comparable to that of traditional RL methods without safety constraints in the later stage of training. At the same time, this algorithm can gradually converge to the optimal RL strategy.
### Summary of Mathematical Formulas
1. **Definition of Markov Decision Process (MDP)**:
\[
M=(S, A, P, r, \gamma)
\]
where \(S\) is the finite state set, \(A\) is the action space, \(P: S\times A\rightarrow P(S)\) is the state transition function, \(r: S\times A\rightarrow[-R_{\text{max}}, R_{\text{max}}]\) is the reward function, and \(\gamma\in(0, 1)\) is the discount factor.
2. **Value Function and Action - Value Function**:
\[
V^{\pi}(s_t)=\mathbb{E}_{a_t, s_{t + 1},\ldots}\left[\sum_{i = 0}^{\infty}\gamma^i r(s_{t + i}, a_{t + i})\right]
\]
\[
Q^{\pi}(s_t, a_t)=\mathbb{E}_{s_{t+1},\ldots}\left[\sum_{i = 0}^{\infty}\gamma^i r(s_{t + i}, a_{t + i})\right]
\]
3. **Bellman Equation**:
\[
V^{\pi}(s)=\mathbb{E}