Abstract:Action-constrained reinforcement learning (ACRL) is a popular approach for solving safety-critical and resource-allocation related decision making problems. A major challenge in ACRL is to ensure agent taking a valid action satisfying constraints in each RL step. Commonly used approach of using a projection layer on top of the policy network requires solving an optimization program which can result in longer training time, slow convergence, and zero gradient problem. To address this, first we use a normalizing flow model to learn an invertible, differentiable mapping between the feasible action space and the support of a simple distribution on a latent variable, such as Gaussian. Second, learning the flow model requires sampling from the feasible action space, which is also challenging. We develop multiple methods, based on Hamiltonian Monte-Carlo and probabilistic sentential decision diagrams for such action sampling for convex and non-convex constraints. Third, we integrate the learned normalizing flow with the DDPG algorithm. By design, a well-trained normalizing flow will transform policy output into a valid action without requiring an optimization solver. Empirically, our approach results in significantly fewer constraint violations (upto an order-of-magnitude for several instances) and is multiple times faster on a variety of continuous control tasks.

What problem does this paper attempt to address?

This paper focuses on solving action constraint problems in reinforcement learning (RL), specifically Action Constrained RL (ACRL). In ACRL, an agent needs to perform actions that satisfy specific constraints at each time step, which is crucial in decision-making problems involving safety and resource allocation. The current commonly used approach is to add a projection layer on top of the policy network to ensure the legality of actions, but this can lead to long training time, slow convergence, and zero-gradient problems. To address these issues, the paper proposes a new method called FlowPG, which utilizes a regularized flow model to learn a reversible and differentiable mapping between the feasible action space and the support of a simple distribution (e.g., Gaussian distribution). The regularized flow model can learn to generate effective actions from a limited set of valid actions. Additionally, the paper develops methods based on Hamiltonian Monte Carlo and probabilistic syntactic decision diagrams to handle action sampling under both convex and non-convex constraints. FlowPG combines the regularized flow model with the Deep Deterministic Policy Gradient (DDPG) algorithm. By training a well-behaved regularized flow model, the output of the policy network can directly be transformed into valid actions without the need for an optimization solver, thus avoiding zero-gradient problems and computationally expensive quadratic programming. Experiments show that FlowPG significantly reduces the number of constraint violations and improves training speed in various continuous control tasks. In summary, this paper attempts to address how to efficiently and effectively handle constraint problems in continuous action spaces in reinforcement learning for a safer and faster training process.

FlowPG: Action-constrained Policy Gradient with Normalizing Flows

Towards Interpretable Reinforcement Learning with Constrained Normalizing Flow Policies

GFlowNet Training by Policy Gradients

Leveraging exploration in off-policy algorithms via normalizing flows

Advanced deep-reinforcement-learning methods for flow control: group-invariant and positional-encoding networks improve learning speed and quality

Generative Flow Networks as Entropy-Regularized RL

FlowPolicy: Enabling Fast and Robust 3D Flow-based Policy via Consistency Flow Matching for Robot Manipulation

Convex Regularization and Convergence of Policy Gradient Flows under Safety Constraints

AdaFlow: Imitation Learning with Variance-Adaptive Flow-Based Policies

QGFN: Controllable Greediness with Action Values

Design of Restricted Normalizing Flow towards Arbitrary Stochastic Policy with Computational Efficiency

FlowNav: Learning Efficient Navigation Policies via Conditional Flow Matching

Maximum Entropy Reinforcement Learning via Energy-Based Normalizing Flow

Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning

Action abstractions for amortized sampling

Accelerating Deep Reinforcement Learning strategies of Flow Control through a multi-environment approach

Rectifying Reinforcement Learning for Reward Matching

Model-based deep reinforcement learning for accelerated learning from flow simulations