Abstract:In this paper, we seek to learn a robot policy guaranteed to satisfy state constraints. To encourage constraint satisfaction, existing RL algorithms typically rely on Constrained Markov Decision Processes and discourage constraint violations through reward shaping. However, such soft constraints cannot offer verifiable safety guarantees. To address this gap, we propose POLICEd RL, a novel RL algorithm explicitly designed to enforce affine hard constraints in closed-loop with a black-box environment. Our key insight is to force the learned policy to be affine around the unsafe set and use this affine region as a repulsive buffer to prevent trajectories from violating the constraint. We prove that such policies exist and guarantee constraint satisfaction. Our proposed framework is applicable to both systems with continuous and discrete state and action spaces and is agnostic to the choice of the RL training algorithm. Our results demonstrate the capacity of POLICEd RL to enforce hard constraints in robotic tasks while significantly outperforming existing methods.

What problem does this paper attempt to address?

This paper attempts to address the problem of ensuring hard constraint satisfaction in robotic control strategies. Specifically, existing Reinforcement Learning (RL) algorithms typically encourage compliance with constraints through reward shaping, but this approach cannot provide verifiable safety guarantees. To solve this problem, the authors propose a new RL algorithm—POLICEd RL, which can explicitly enforce affine hard constraints in closed-loop systems. ### Main Issues 1. **Limitations of Existing Methods**: - Existing RL algorithms typically rely on Constrained Markov Decision Processes (CMDPs) and penalize constraint violations through reward shaping. - This soft constraint approach cannot provide verifiable safety guarantees. 2. **Objectives**: - Propose a new RL algorithm that ensures hard constraint satisfaction in closed-loop systems. - The algorithm needs to learn a control strategy in a deterministic black-box environment, ensuring that the strategy does not lead to states violating given affine constraints. ### Solution - **POLICEd RL**: - Creates a repulsive buffer zone by enforcing the learned policy to be affine near unsafe regions, preventing trajectories from entering unsafe areas. - Proves the existence of such a policy and guarantees constraint satisfaction. - The framework is applicable to both continuous and discrete state-action spaces and is independent of the chosen RL training algorithm. ### Key Innovations 1. **Repulsive Buffer Zone**: - Creates a repulsive buffer zone around unsafe regions, learning a policy that keeps the robot's state away from unsafe areas. - By restricting the policy output to be affine within the buffer zone, it is easy to verify whether trajectories violate constraints. 2. **Theoretical Guarantees**: - Establishes analytical conditions ensuring that the learned policy satisfies hard constraints after training. - Transforms the problem of finding a constraint-satisfying policy into a solvable linear problem. 3. **Applicability**: - The method is applicable to both continuous and discrete state-action spaces and can handle black-box environments. - Adapts to black-box environments through a local nonlinear metric, which can be obtained via numerical estimation. ### Experimental Validation - Validated the effectiveness of POLICEd RL through numerical experiments and tasks involving an inverted pendulum and a robotic arm in the high-fidelity MuJoCo simulator. - Results show that POLICEd RL not only outperforms baseline methods in constraint satisfaction but also excels in expected cumulative rewards. ### Summary This paper proposes a new RL algorithm—POLICEd RL, which ensures hard constraint satisfaction in closed-loop systems, addressing the issue of existing methods' inability to provide verifiable safety guarantees. By creating a repulsive buffer zone and restricting policy output to be affine, the algorithm demonstrates superior performance both theoretically and experimentally.

POLICEd RL: Learning Closed-Loop Robot Control Policies with Provable Satisfaction of Hard Constraints

Learning to Provably Satisfy High Relative Degree Constraints for Black-Box Systems

Learning Observation-Based Certifiable Safe Policy for Decentralized Multi-Robot Navigation

Successive Convex Approximation Based Off-Policy Optimization for Constrained Reinforcement Learning

ROSCOM: Robust Safe Reinforcement Learning on Stochastic Constraint Manifolds

Safe Multiagent Learning with Soft Constrained Policy Optimization in Real Robot Control

Probabilistic Constraint for Safety-Critical Reinforcement Learning

State-wise Constrained Policy Optimization

Lyapunov Barrier Policy Optimization

Reduced Policy Optimization for Continuous Control with Hard Constraints

Safe Reinforcement Learning for Autonomous Vehicles through Parallel Constrained Policy Optimization

Constrained Variational Policy Optimization for Safe Reinforcement Learning

Learn With Imagination: Safe Set Guided State-wise Constrained Policy Optimization

Safe reinforcement learning for probabilistic reachability and safety specifications: A Lyapunov-based approach

Towards Online Safety Corrections for Robotic Manipulation Policies

Absolute State-wise Constrained Policy Optimization: High-Probability State-wise Constraints Satisfaction

Model-based Safe Deep Reinforcement Learning via a Constrained Proximal Policy Optimization Algorithm

Concurrent Learning of Policy and Unknown Safety Constraints in Reinforcement Learning

Lyapunov-based Safe Policy Optimization for Continuous Control

Efficient Exploration Using Extra Safety Budget in Constrained Policy Optimization

Safety Correction from Baseline: Towards the Risk-aware Policy in Robotics Via Dual-agent Reinforcement Learning