Abstract:Reinforcement learning (RL) is attracting increasing interests in autonomous driving due to its potential to solve complex classification and control problems. However, existing RL algorithms are rarely applied to real vehicles for two predominant problems: behaviours are unexplainable, and they cannot guarantee safety under new scenarios. This paper presents a safe RL algorithm, called Parallel Constrained Policy Optimization (PCPO), for two autonomous driving tasks. PCPO extends today's common actor-critic architecture to a three-component learning framework, in which three neural networks are used to approximate the policy function, value function and a newly added risk function, respectively. Meanwhile, a trust region constraint is added to allow large update steps without breaking the monotonic improvement condition. To ensure the feasibility of safety constrained problems, synchronized parallel learners are employed to explore different state spaces, which accelerates learning and policy-update. The simulations of two scenarios for autonomous vehicles confirm we can ensure safety while achieving fast learning.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to achieve fast and safe Reinforcement Learning (RL) in autonomous driving tasks. Specifically, existing RL algorithms have two main problems when applied to real - vehicle scenarios: the behavior is not interpretable, and safety cannot be guaranteed in new scenarios. These problems limit the application of RL in actual autonomous driving systems. To solve the above problems, the paper proposes a new safe reinforcement learning algorithm - Parallel Constrained Policy Optimization (PCPO). The main contributions of the PCPO algorithm are as follows: 1. **Introducing the risk function**: In addition to the traditional reward function, a risk function is introduced to evaluate the safety of the policy. By constraining the expected risk not to exceed the predefined risk limit, the safety of the policy during the learning process is ensured. 2. **Trust - region constraint**: In order to allow a larger policy update step size without violating the monotonic improvement condition, PCPO adds a trust - region constraint. This enables the algorithm to accelerate the convergence speed while ensuring safety. 3. **Synchronous parallel learning framework**: By having multiple parallel learners simultaneously explore different state spaces, the correlation of the sample set is reduced, and the possibility of finding a feasible state is increased, thereby accelerating the learning process. The paper verifies the effectiveness of the PCPO algorithm through experiments on two autonomous driving tasks: - **Lane - keeping task**: The experimental results show that the PCPO algorithm can ensure that all parallel vehicles always stay within the lane during the learning process, and the deviation is rapidly reduced. - **Intersection multi - vehicle decision - making task**: The experimental results indicate that the PCPO algorithm can not only ensure safety but also converge to the optimal policy at a relatively fast speed. Overall, the PCPO algorithm improves the learning speed and data efficiency while ensuring safety, reduces the possibility of the learning agent getting trapped in sub - optimal solutions, or at least ensures that it reaches a safe sub - optimal policy.

Safe Reinforcement Learning for Autonomous Vehicles through Parallel Constrained Policy Optimization

Model-Based Safe Reinforcement Learning with Time-Varying State and Control Constraints: An Application to Intelligent Vehicles

Model-Based Safe Reinforcement Learning With Time-Varying Constraints: Applications to Intelligent Vehicles

State-wise Constrained Policy Optimization

Safe Efficient Policy Optimization Algorithm for Unsignalized Intersection Navigation

Long and Short-Term Constraints Driven Safe Reinforcement Learning for Autonomous Driving

Safe-State Enhancement Method for Autonomous Driving Via Direct Hierarchical Reinforcement Learning.

Towards Robust Decision-Making for Autonomous Highway Driving Based on Safe Reinforcement Learning

Absolute State-wise Constrained Policy Optimization: High-Probability State-wise Constraints Satisfaction

Learn With Imagination: Safe Set Guided State-wise Constrained Policy Optimization

Safe Exploration in Wireless Security: A Safe Reinforcement Learning Algorithm With Hierarchical Structure

Safe Autonomous Driving with Latent Dynamics and State-Wise Constraints

Offline Reinforcement Learning for Autonomous Driving with Safety and Exploration Enhancement

Safe Reinforcement Learning with Probabilistic Control Barrier Functions for Ramp Merging

Automated Driving Maneuvers under Interactive Environment based on Deep Reinforcement Learning

Safe Driving Via Expert Guided Policy Optimization

Enhancing System-Level Safety in Mixed-Autonomy Platoon via Safe Reinforcement Learning

Weakly Supervised Reinforcement Learning for Autonomous Highway Driving via Virtual Safety Cages

Knowledge Transfer from Simple to Complex: A Safe and Efficient Reinforcement Learning Framework for Autonomous Driving Decision-Making

Safe Deep Policy Adaptation

Multi-objective Optimization Based Deep Reinforcement Learning for Autonomous Driving Policy