Safe Reinforcement Learning for Autonomous Vehicles through Parallel Constrained Policy Optimization

Lu Wen,Jingliang Duan,Shengbo Eben Li,Shaobing Xu,Huei Peng
DOI: https://doi.org/10.48550/arXiv.2003.01303
2020-03-03
Abstract:Reinforcement learning (RL) is attracting increasing interests in autonomous driving due to its potential to solve complex classification and control problems. However, existing RL algorithms are rarely applied to real vehicles for two predominant problems: behaviours are unexplainable, and they cannot guarantee safety under new scenarios. This paper presents a safe RL algorithm, called Parallel Constrained Policy Optimization (PCPO), for two autonomous driving tasks. PCPO extends today's common actor-critic architecture to a three-component learning framework, in which three neural networks are used to approximate the policy function, value function and a newly added risk function, respectively. Meanwhile, a trust region constraint is added to allow large update steps without breaking the monotonic improvement condition. To ensure the feasibility of safety constrained problems, synchronized parallel learners are employed to explore different state spaces, which accelerates learning and policy-update. The simulations of two scenarios for autonomous vehicles confirm we can ensure safety while achieving fast learning.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to achieve fast and safe Reinforcement Learning (RL) in autonomous driving tasks. Specifically, existing RL algorithms have two main problems when applied to real - vehicle scenarios: the behavior is not interpretable, and safety cannot be guaranteed in new scenarios. These problems limit the application of RL in actual autonomous driving systems. To solve the above problems, the paper proposes a new safe reinforcement learning algorithm - Parallel Constrained Policy Optimization (PCPO). The main contributions of the PCPO algorithm are as follows: 1. **Introducing the risk function**: In addition to the traditional reward function, a risk function is introduced to evaluate the safety of the policy. By constraining the expected risk not to exceed the predefined risk limit, the safety of the policy during the learning process is ensured. 2. **Trust - region constraint**: In order to allow a larger policy update step size without violating the monotonic improvement condition, PCPO adds a trust - region constraint. This enables the algorithm to accelerate the convergence speed while ensuring safety. 3. **Synchronous parallel learning framework**: By having multiple parallel learners simultaneously explore different state spaces, the correlation of the sample set is reduced, and the possibility of finding a feasible state is increased, thereby accelerating the learning process. The paper verifies the effectiveness of the PCPO algorithm through experiments on two autonomous driving tasks: - **Lane - keeping task**: The experimental results show that the PCPO algorithm can ensure that all parallel vehicles always stay within the lane during the learning process, and the deviation is rapidly reduced. - **Intersection multi - vehicle decision - making task**: The experimental results indicate that the PCPO algorithm can not only ensure safety but also converge to the optimal policy at a relatively fast speed. Overall, the PCPO algorithm improves the learning speed and data efficiency while ensuring safety, reduces the possibility of the learning agent getting trapped in sub - optimal solutions, or at least ensures that it reaches a safe sub - optimal policy.