Abstract:Recent advances in constrained reinforcement learning (RL) have endowed reinforcement learning with certain safety guarantees. However, deploying existing constrained RL algorithms in continuous control tasks with general hard constraints remains challenging, particularly in those situations with non-convex hard constraints. Inspired by the generalized reduced gradient (GRG) algorithm, a classical constrained optimization technique, we propose a reduced policy optimization (RPO) algorithm that combines RL with GRG to address general hard constraints. RPO partitions actions into basic actions and nonbasic actions following the GRG method and outputs the basic actions via a policy network. Subsequently, RPO calculates the nonbasic actions by solving equations based on equality constraints using the obtained basic actions. The policy network is then updated by implicitly differentiating nonbasic actions with respect to basic actions. Additionally, we introduce an action projection procedure based on the reduced gradient and apply a modified Lagrangian relaxation technique to ensure inequality constraints are satisfied. To the best of our knowledge, RPO is the first attempt that introduces GRG to RL as a way of efficiently handling both equality and inequality hard constraints. It is worth noting that there is currently a lack of RL environments with complex hard constraints, which motivates us to develop three new benchmarks: two robotics manipulation tasks and a smart grid operation control task. With these benchmarks, RPO achieves better performance than previous constrained RL algorithms in terms of both cumulative reward and constraint violation. We believe RPO, along with the new benchmarks, will open up new opportunities for applying RL to real-world problems with complex constraints.

Nearly Optimal Policy Optimization with Stable at Any Time Guarantee

Behavior Proximal Policy Optimization

Beyond Reward: Offline Preference-guided Policy Optimization

An Off-Policy Trust Region Policy Optimization Method with Monotonic Improvement Guarantee for Deep Reinforcement Learning

Successive Convex Approximation Based Off-Policy Optimization for Constrained Reinforcement Learning

A Policy Optimization Method Towards Optimal-time Stability

Provably Efficient Exploration in Policy Optimization

Absolute Policy Optimization

A Theoretical Analysis of Optimistic Proximal Policy Optimization in Linear Markov Decision Processes

Reflective Policy Optimization

Truly Proximal Policy Optimization

Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on Efficient Data Utilization

Absolute State-wise Constrained Policy Optimization: High-Probability State-wise Constraints Satisfaction

Policy Optimization over General State and Action Spaces

Trust Region-Guided Proximal Policy Optimization

State-wise Constrained Policy Optimization

Proximal Policy Optimization Algorithms

Reduced Policy Optimization for Continuous Control with Hard Constraints

Decentralized Policy Optimization

MOPO: Model-based Offline Policy Optimization