Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

Denis Steckelmacher,Hélène Plisnier,Diederik M. Roijers,Ann Nowé
DOI: https://doi.org/10.48550/arXiv.1903.04193
2019-06-12
Abstract:Value-based reinforcement-learning algorithms provide state-of-the-art results in model-free discrete-action settings, and tend to outperform actor-critic algorithms. We argue that actor-critic algorithms are limited by their need for an on-policy critic. We propose Bootstrapped Dual Policy Iteration (BDPI), a novel model-free reinforcement-learning algorithm for continuous states and discrete actions, with an actor and several off-policy critics. Off-policy critics are compatible with experience replay, ensuring high sample-efficiency, without the need for off-policy corrections. The actor, by slowly imitating the average greedy policy of the critics, leads to high-quality and state-specific exploration, which we compare to Thompson sampling. Because the actor and critics are fully decoupled, BDPI is remarkably stable, and unusually robust to its hyper-parameters. BDPI is significantly more sample-efficient than Bootstrapped DQN, PPO, and ACKTR, on discrete, continuous and pixel-based tasks. Source code: <a class="link-external link-https" href="https://github.com/vub-ai-lab/bdpi" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the limitations of the Actor - Critic algorithms in current reinforcement learning, especially the deficiencies of these algorithms in exploration efficiency and sample efficiency. Specifically: 1. **Limitations of Actor - Critic Algorithms**: Most existing Actor - Critic algorithms rely on on - policy critics to evaluate the actor's policy. This dependence restricts the combination of these algorithms with efficient off - policy value methods (such as algorithms in the Q - Learning family), resulting in poor performance in sample efficiency and exploration ability. 2. **Exploration Efficiency**: In discrete - action settings, off - policy value methods can efficiently use samples for learning, but how to combine the Actor - Critic framework in this setting to improve exploration efficiency is an unsolved problem. 3. **Sample Efficiency**: Existing Actor - Critic algorithms often require a large number of samples to converge when dealing with high - dimensional state spaces and sparse - reward tasks, which is not feasible in practical applications. To solve these problems, the paper proposes a new model - free reinforcement learning algorithm - Bootstrapped Dual Policy Iteration (BDPI). BDPI aims to improve the sample efficiency and exploration ability of the algorithm by introducing multiple off - policy critics and an actor. The specific contributions are as follows: - **Off - Policy Critics**: BDPI uses multiple off - policy critics, which can efficiently use Experience Replay, thereby improving sample efficiency. - **Actor Learning Rule**: The actor is gradually updated by imitating the greedy policy of the critics. This learning rule is similar to Conservative Policy Iteration, but uses the information of off - policy critics. - **Exploration Ability**: The actor's learning rule combines the information of multiple off - policy critics and can achieve high - quality exploration, similar to Thompson Sampling. Experimental results show that BDPI significantly outperforms existing Actor - Critic algorithms and other reinforcement learning algorithms in various environments, especially in tasks with high exploration difficulty. In addition, BDPI has high robustness to the selection of hyper - parameters, which is very important in practical applications.