Abstract:Deep Actor-Critic algorithms, which combine Actor-Critic with deep neural network (DNN), have been among the most prevalent reinforcement learning algorithms for decision-making problems in simulated environments. However, the existing deep Actor-Critic algorithms are still not mature to solve realistic problems with non-convex stochastic constraints and high cost to interact with the environment. In this paper, we propose a single-loop deep Actor-Critic (SLDAC) algorithmic framework for general constrained reinforcement learning (CRL) problems. In the actor step, the constrained stochastic successive convex approximation (CSSCA) method is applied to handle the non-convex stochastic objective and constraints. In the critic step, the critic DNNs are only updated once or a few finite times for each iteration, which simplifies the algorithm to a single-loop framework (the existing works require a sufficient number of updates for the critic step to ensure a good enough convergence of the inner loop for each iteration). Moreover, the variance of the policy gradient estimation is reduced by reusing observations from the old policy. The single-loop design and the observation reuse effectively reduce the agent-environment interaction cost and computational complexity. In spite of the biased policy gradient estimation incurred by the single-loop design and observation reuse, we prove that the SLDAC with a feasible initial point can converge to a Karush-Kuhn-Tuker (KKT) point of the original problem almost surely. Simulations show that the SLDAC algorithm can achieve superior performance with much lower interaction cost.

Simultaneous Double Q-learning with Conservative Advantage Learning for Actor-Critic Methods

Finite-Time Analysis of Simultaneous Double Q-learning

Efficient Continuous Control with Double Actors and Regularized Critics

Mitigating Off-Policy Bias in Actor-Critic Methods with One-Step Q-learning: A Novel Correction Approach

Exploiting Estimation Bias in Clipped Double Q-Learning for Continous Control Reinforcement Learning Tasks

PAC-Bayesian Soft Actor-Critic Learning

Self-correcting Q-learning.

Actor-Critic With Synthesis Loss for Solving Approximation Biases

Soft Decomposed Policy-Critic: Bridging the Gap for Effective Continuous Control with Discrete RL

Adapting Double Q-Learning for Continuous Reinforcement Learning

A Single-Loop Deep Actor-Critic Algorithm for Constrained Reinforcement Learning with Provable Convergence

Strategically Conservative Q-Learning

Sample-Efficient Reinforcement Learning Via Conservative Model-Based Actor-Critic.

Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors

Double Successive Over-Relaxation Q-Learning with an Extension to Deep Reinforcement Learning

Randomized Ensembled Double Q-Learning: Learning Fast Without a Model

Revisiting Discrete Soft Actor-Critic

Seizing Serendipity: Exploiting the Value of Past Success in Off-Policy Actor-Critic

Successively Pruned Q-Learning: Using Self Q-function to Reduce the Overestimation.

Explorer-Actor-Critic: Better Actors for Deep Reinforcement Learning

Actor-Critic Reinforcement Learning with Phased Actor