Abstract:Safe and optimal controller synthesis for switched-controlled hybrid systems, which combine differential equations and discrete changes of the system's state, is known to be intricately hard. Reinforcement learning has been leveraged to construct near-optimal controllers, but their behavior is not guaranteed to be safe, even when it is encouraged by reward engineering. One way of imposing safety to a learned controller is to use a shield, which is correct by design. However, obtaining a shield for non-linear and hybrid environments is itself intractable. In this paper, we propose the construction of a shield using the so-called barbaric method, where an approximate finite representation of an underlying partition-based two-player safety game is extracted via systematically picked samples of the true transition function. While hard safety guarantees are out of reach, we experimentally demonstrate strong statistical safety guarantees with a prototype implementation and UPPAAL STRATEGO. Furthermore, we study the impact of the synthesized shield when applied as either a pre-shield (applied before learning a controller) or a post-shield (only applied after learning a controller). We experimentally demonstrate superiority of the pre-shielding approach. We apply our technique on a range of case studies, including two industrial examples, and further study post-optimization of the post-shielding approach.

What problem does this paper attempt to address?

The paper primarily aims to address the problem of designing safe and approximately optimal controllers in hybrid systems (systems that combine discrete control and continuous dynamic characteristics). Specifically, the paper focuses on how to apply reinforcement learning in such systems to construct controllers while ensuring that the behavior of these controllers is safe. In hybrid systems, the design of controllers is a complex and challenging task. Although Reinforcement Learning (RL) can be used to construct near-optimal controllers, the behavior of these controllers often lacks safety guarantees, especially in worst-case scenarios. To overcome this challenge, researchers have attempted to encourage safe behavior through reward engineering, but this does not completely prevent unsafe situations and may reduce the overall performance of the controller. To address the above issues, the paper proposes a new method, namely the use of a so-called "barbaric method" to construct a safety mechanism called a "shield." This shield is correctly designed to restrict the behavior of the controller to ensure its safety. Specifically, the method first performs a finite partitioning of the state space and systematically selects sample points to approximately represent the underlying two-player safety game, thereby obtaining a shield strategy. Although this method cannot provide strict hard safety guarantees, experimental results show that it can provide strong statistical safety assurances. The paper also explores two methods of applying the shield: pre-shielding and post-shielding. Pre-shielding is applied during the learning phase, where the learning agent can only choose from safe actions; post-shielding is applied only during the deployment phase to monitor and correct the already trained agent. Experimental results show that the pre-shielding method outperforms the post-shielding method. Additionally, the paper evaluates the performance of the proposed shield synthesis technique in a series of case studies, including two industrial examples, and further investigates the optimization potential of the post-shielding method. In summary, the paper aims to provide a practical and powerful solution for the safe control of hybrid systems, especially in cases involving complex and nonlinear dynamics.

Shielded Reinforcement Learning for Hybrid Systems

Safe Reinforcement Learning via Shielding

Learning-Based Shielding for Safe Autonomy under Unknown Dynamics

Approximate Model-Based Shielding for Safe Reinforcement Learning

Automata Learning meets Shielding

Safe Reinforcement Learning via Probabilistic Shields

Human-Feedback Shield Synthesis for Perceived Safety in Deep Reinforcement Learning

End-to-End Safe Reinforcement Learning through Barrier Functions for Safety-Critical Continuous Control Tasks

Safe Reinforcement Learning with Nonlinear Dynamics via Model Predictive Shielding

Leveraging Approximate Model-based Shielding for Probabilistic Safety Guarantees in Continuous Environments

Safe Reinforcement Learning via Probabilistic Logic Shields

Dynamic Shielding for Reinforcement Learning in Black-Box Environments

Model-based Dynamic Shielding for Safe and Efficient Multi-Agent Reinforcement Learning

Safe Controller for Output Feedback Linear Systems using Model-Based Reinforcement Learning

Safe Reinforcement Learning Using Robust Control Barrier Functions

Learning-Based Synthesis of Safety Controllers

Safe Barrier-Constrained Control of Uncertain Systems via Event-triggered Learning

Learning Local Control Barrier Functions for Hybrid Systems

Compositional Shielding and Reinforcement Learning for Multi-Agent Systems

Safety-Constrained Reinforcement Learning for MDPs

Online Shielding for Reinforcement Learning