Dynamics Adaptive Safe Reinforcement Learning with a Misspecified Simulator

Ruiqi Xue,Ziqian Zhang,Lihe Li,Feng Chen,Yi-Chen Li,Yang Yu,Lei Yuan
DOI: https://doi.org/10.1007/978-3-031-70368-3_5
2024-01-01
Abstract:Sim-to-real reinforcement learning offers the advantage of learning safe policies within simulators, circumventing the need for costly trial-and-error in the real world. Traditional approaches often rest on the assumption of consistent state-action transition between the simulator and the real-world environment. However, this assumption can be violated due to the poor fidelity of simulators, leading to a constrained trust region for effective policy learning. The limitation can be more pronounced when safety issues are considered, potentially resulting in threatening policies if no safe samples exist in the trust region. To overcome these challenges, we propose Dynamics Adaptive Safe Reinforcement Learning with a Misspecified Simulator (DASaR). Our approach begins by relaxing the assumption to expand the trust region and theoretically demonstrate the unbounded performance gap inherent in traditional methods. Subsequently, DASaR aligns the estimated value functions in the simulator and the real-world environment via inverse dynamics-based relabeling of reward and cost signals. Furthermore, to deal with the underestimation of cost value functions, DASaR employs uncertainty estimation to improve its conservatism, ensuring the safety of the learned policy. Experiments in various complex environments thoroughly demonstrate DASaR's outstanding ability to balance safety satisfaction and reward maximization across diverse dynamics gaps.
What problem does this paper attempt to address?