POLICEd RL: Learning Closed-Loop Robot Control Policies with Provable Satisfaction of Hard Constraints

Jean-Baptiste Bouvier,Kartik Nagpal,Negar Mehr
2024-06-04
Abstract:In this paper, we seek to learn a robot policy guaranteed to satisfy state constraints. To encourage constraint satisfaction, existing RL algorithms typically rely on Constrained Markov Decision Processes and discourage constraint violations through reward shaping. However, such soft constraints cannot offer verifiable safety guarantees. To address this gap, we propose POLICEd RL, a novel RL algorithm explicitly designed to enforce affine hard constraints in closed-loop with a black-box environment. Our key insight is to force the learned policy to be affine around the unsafe set and use this affine region as a repulsive buffer to prevent trajectories from violating the constraint. We prove that such policies exist and guarantee constraint satisfaction. Our proposed framework is applicable to both systems with continuous and discrete state and action spaces and is agnostic to the choice of the RL training algorithm. Our results demonstrate the capacity of POLICEd RL to enforce hard constraints in robotic tasks while significantly outperforming existing methods.
Robotics
What problem does this paper attempt to address?
This paper attempts to address the problem of ensuring hard constraint satisfaction in robotic control strategies. Specifically, existing Reinforcement Learning (RL) algorithms typically encourage compliance with constraints through reward shaping, but this approach cannot provide verifiable safety guarantees. To solve this problem, the authors propose a new RL algorithm—POLICEd RL, which can explicitly enforce affine hard constraints in closed-loop systems. ### Main Issues 1. **Limitations of Existing Methods**: - Existing RL algorithms typically rely on Constrained Markov Decision Processes (CMDPs) and penalize constraint violations through reward shaping. - This soft constraint approach cannot provide verifiable safety guarantees. 2. **Objectives**: - Propose a new RL algorithm that ensures hard constraint satisfaction in closed-loop systems. - The algorithm needs to learn a control strategy in a deterministic black-box environment, ensuring that the strategy does not lead to states violating given affine constraints. ### Solution - **POLICEd RL**: - Creates a repulsive buffer zone by enforcing the learned policy to be affine near unsafe regions, preventing trajectories from entering unsafe areas. - Proves the existence of such a policy and guarantees constraint satisfaction. - The framework is applicable to both continuous and discrete state-action spaces and is independent of the chosen RL training algorithm. ### Key Innovations 1. **Repulsive Buffer Zone**: - Creates a repulsive buffer zone around unsafe regions, learning a policy that keeps the robot's state away from unsafe areas. - By restricting the policy output to be affine within the buffer zone, it is easy to verify whether trajectories violate constraints. 2. **Theoretical Guarantees**: - Establishes analytical conditions ensuring that the learned policy satisfies hard constraints after training. - Transforms the problem of finding a constraint-satisfying policy into a solvable linear problem. 3. **Applicability**: - The method is applicable to both continuous and discrete state-action spaces and can handle black-box environments. - Adapts to black-box environments through a local nonlinear metric, which can be obtained via numerical estimation. ### Experimental Validation - Validated the effectiveness of POLICEd RL through numerical experiments and tasks involving an inverted pendulum and a robotic arm in the high-fidelity MuJoCo simulator. - Results show that POLICEd RL not only outperforms baseline methods in constraint satisfaction but also excels in expected cumulative rewards. ### Summary This paper proposes a new RL algorithm—POLICEd RL, which ensures hard constraint satisfaction in closed-loop systems, addressing the issue of existing methods' inability to provide verifiable safety guarantees. By creating a repulsive buffer zone and restricting policy output to be affine, the algorithm demonstrates superior performance both theoretically and experimentally.