Abstract:Value-based reinforcement-learning algorithms provide state-of-the-art results in model-free discrete-action settings, and tend to outperform actor-critic algorithms. We argue that actor-critic algorithms are limited by their need for an on-policy critic. We propose Bootstrapped Dual Policy Iteration (BDPI), a novel model-free reinforcement-learning algorithm for continuous states and discrete actions, with an actor and several off-policy critics. Off-policy critics are compatible with experience replay, ensuring high sample-efficiency, without the need for off-policy corrections. The actor, by slowly imitating the average greedy policy of the critics, leads to high-quality and state-specific exploration, which we compare to Thompson sampling. Because the actor and critics are fully decoupled, BDPI is remarkably stable, and unusually robust to its hyper-parameters. BDPI is significantly more sample-efficient than Bootstrapped DQN, PPO, and ACKTR, on discrete, continuous and pixel-based tasks. Source code: <a class="link-external link-https" href="https://github.com/vub-ai-lab/bdpi" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the limitations of the Actor - Critic algorithms in current reinforcement learning, especially the deficiencies of these algorithms in exploration efficiency and sample efficiency. Specifically: 1. **Limitations of Actor - Critic Algorithms**: Most existing Actor - Critic algorithms rely on on - policy critics to evaluate the actor's policy. This dependence restricts the combination of these algorithms with efficient off - policy value methods (such as algorithms in the Q - Learning family), resulting in poor performance in sample efficiency and exploration ability. 2. **Exploration Efficiency**: In discrete - action settings, off - policy value methods can efficiently use samples for learning, but how to combine the Actor - Critic framework in this setting to improve exploration efficiency is an unsolved problem. 3. **Sample Efficiency**: Existing Actor - Critic algorithms often require a large number of samples to converge when dealing with high - dimensional state spaces and sparse - reward tasks, which is not feasible in practical applications. To solve these problems, the paper proposes a new model - free reinforcement learning algorithm - Bootstrapped Dual Policy Iteration (BDPI). BDPI aims to improve the sample efficiency and exploration ability of the algorithm by introducing multiple off - policy critics and an actor. The specific contributions are as follows: - **Off - Policy Critics**: BDPI uses multiple off - policy critics, which can efficiently use Experience Replay, thereby improving sample efficiency. - **Actor Learning Rule**: The actor is gradually updated by imitating the greedy policy of the critics. This learning rule is similar to Conservative Policy Iteration, but uses the information of off - policy critics. - **Exploration Ability**: The actor's learning rule combines the information of multiple off - policy critics and can achieve high - quality exploration, similar to Thompson Sampling. Experimental results show that BDPI significantly outperforms existing Actor - Critic algorithms and other reinforcement learning algorithms in various environments, especially in tasks with high exploration difficulty. In addition, BDPI has high robustness to the selection of hyper - parameters, which is very important in practical applications.

Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

Frugal Actor-Critic: Sample Efficient Off-Policy Deep Reinforcement Learning Using Unique Experiences

Doubly Robust Off-Policy Actor-Critic Algorithms for Reinforcement Learning

Off-Policy Neural Fitted Actor-Critic

Sample-Efficient Reinforcement Learning Via Conservative Model-Based Actor-Critic.

Deep Model-Based Reinforcement Learning for Predictive Control of Robotic Systems with Dense and Sparse Rewards

Efficient Exploration in Deep Reinforcement Learning: A Novel Bayesian Actor-Critic Algorithm

Discriminator-Actor-Critic: Addressing Sample Inefficiency and Reward Bias in Adversarial Imitation Learning

Toward a data efficient neural actor-critic

Model Predictive Actor-Critic: Accelerating Robot Skill Acquisition with Deep Reinforcement Learning

The Actor-Advisor: Policy Gradient With Off-Policy Advice

Actor-Critic Reinforcement Learning with Phased Actor

Efficiently Training On-Policy Actor-Critic Networks in Robotic Deep Reinforcement Learning with Demonstration-like Sampled Exploration

Revisiting Gaussian mixture critics in off-policy reinforcement learning: a sample-based approach

OPAC: Opportunistic Actor-Critic

Modified Actor-Critics

Quality-Diversity Actor-Critic: Learning High-Performing and Diverse Behaviors via Value and Successor Features Critics

Efficient Continuous Control with Double Actors and Regularized Critics

Recursive Least Squares Advantage Actor-Critic Algorithms

On the sample complexity of actor-critic method for reinforcement learning with function approximation

Value Improved Actor Critic Algorithms