SafeDreamer: Safe Reinforcement Learning with World Models

Weidong Huang,Jiaming Ji,Chunhe Xia,Borong Zhang,Yaodong Yang
2024-08-08
Abstract:The deployment of Reinforcement Learning (RL) in real-world applications is constrained by its failure to satisfy safety criteria. Existing Safe Reinforcement Learning (SafeRL) methods, which rely on cost functions to enforce safety, often fail to achieve zero-cost performance in complex scenarios, especially vision-only tasks. These limitations are primarily due to model inaccuracies and inadequate sample efficiency. The integration of the world model has proven effective in mitigating these shortcomings. In this work, we introduce SafeDreamer, a novel algorithm incorporating Lagrangian-based methods into world model planning processes within the superior Dreamer framework. Our method achieves nearly zero-cost performance on various tasks, spanning low-dimensional and vision-only input, within the Safety-Gymnasium benchmark, showcasing its efficacy in balancing performance and safety in RL tasks. Further details can be found in the code repository: \url{<a class="link-external link-https" href="https://github.com/PKU-Alignment/SafeDreamer" rel="external noopener nofollow">this https URL</a>}.
Machine Learning,Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in real - world applications, reinforcement learning (RL) agents have difficulties in meeting safety standards. Existing safe reinforcement learning (SafeRL) methods rely on cost functions to enforce safety, but in complex scenarios, especially in vision - only tasks, they often fail to achieve zero - cost performance. These limitations are mainly attributed to inaccurate models and insufficient sample efficiency. To this end, the paper proposes SafeDreamer, a new algorithm that balances performance and safety by integrating the Lagrange method into the world - model planning process of the Dreamer framework. SafeDreamer can achieve near - zero - cost performance in various tasks, including low - dimensional inputs and vision - only inputs, demonstrating its effectiveness in the Safety - Gymnasium benchmark. Specifically, the paper focuses on the following key issues: 1. **How to maximize rewards while ensuring safety**: The paper proposes an Online Safety - Reward Planning (OSRP) algorithm and proves the feasibility of using online planning in the world model to meet constraints. In particular, the paper adopts the Constrained Cross - Entropy Method for the planning process in vision - only tasks. 2. **How to balance long - term rewards and costs**: The paper combines the Lagrange method with online and background planning of safety rewards in the world model and proposes two algorithms, OSRP - Lag and BSRP - Lag, to balance long - term rewards and costs. These two algorithms are respectively used to handle safety and reward optimization problems in online planning and background planning. 3. **How to handle low - dimensional and visual - input tasks**: SafeDreamer can handle low - dimensional and visual - input tasks, achieving near - zero - cost performance in the Safety - Gymnasium benchmark and outperforming existing model - based methods in multiple environments. In summary, the paper aims to solve the problem that existing SafeRL methods are difficult to achieve zero - cost performance in complex scenarios, especially in vision - only tasks, by introducing the SafeDreamer algorithm. By combining the Lagrange method and the world model, SafeDreamer can achieve higher reward performance while ensuring safety.