Cost-aware Offline Safe Meta Reinforcement Learning with Robust In-Distribution Online Task Adaptation.
Cong Guan,Ruiqi Xue,Ziqian Zhang,Lihe Li,Yi-Chen Li,Lei Yuan,Yang Yu
DOI: https://doi.org/10.5555/3635637.3662927
2024-01-01
Abstract:Despite the gained prominence made by reinforcement learning (RL) in various domains, ensuring safety in real-world applications remains a significant challenge. Offline safe RL, which learns safe policies from pre-collected data, has emerged to address these concerns. However, existing approaches assume a single constraint mode and lack adaptability to diverse safety constraints. In real-world scenarios, we often find ourselves working with datasets gathered from various tasks, with the aim of developing a generalized policy capable of handling unknown tasks during testing. To deal with such offline safe meta RL problem, we introduce a novel framework called COSTA, which is designed to facilitate the learning of a safe generalized policy that can adapt and be transferred to unknown tasks during testing. COSTA addresses two key challenges in offline safe meta RL: First, it develops a cost-aware task inference module using contrastive learning to distinguish tasks based on safety constraints, mitigating the MDP ambiguity problem. Second, COSTA introduces a novel metric, Safe In-Distribution Score (SIDS), to assess the in-distribution degree of trajectories, out of the consideration of both reward maximization and cost constraint satisfaction. Trajectories collected with a safe exploration policy are filtered using SIDS for robust online task adaptation. Experimental results in a tailored benchmark suite within the Mujoco environments demonstrate that COSTA consistently balances safety and reward maximization, outperforming multiple baselines.