Zhao Yang,Thomas. M. Moerland,Mike Preuss,Aske Plaat
Abstract:While deep reinforcement learning has shown important empirical success, it tends to learn relatively slow due to slow propagation of rewards information and slow update of parametric neural networks. Non-parametric episodic memory, on the other hand, provides a faster learning alternative that does not require representation learning and uses maximum episodic return as state-action values for action selection. Episodic memory and reinforcement learning both have their own strengths and weaknesses. Notably, humans can leverage multiple memory systems concurrently during learning and benefit from all of them. In this work, we propose a method called Two-Memory reinforcement learning agent (2M) that combines episodic memory and reinforcement learning to distill both of their strengths. The 2M agent exploits the speed of the episodic memory part and the optimality and the generalization capacity of the reinforcement learning part to complement each other. Our experiments demonstrate that the 2M agent is more data efficient and outperforms both pure episodic memory and pure reinforcement learning, as well as a state-of-the-art memory-augmented RL agent. Moreover, the proposed approach provides a general framework that can be used to combine any episodic memory agent with other off-policy reinforcement learning algorithms.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the deficiency of deep reinforcement learning (DRL) in learning speed. Specifically, although DRL performs well in many fields, such as games, scientific applications, etc., its data efficiency is low because the back - propagation of reward signals and the learning update (including representation learning) process are slow. In contrast, non - parametric representational memory methods (such as episodic memory) provide a faster learning alternative. It does not require representation learning and uses the maximum episodic return as the state - action value for action selection. However, episodic memory has problems when dealing with stochastic tasks and lacks a learnable feature representation, which makes generalization difficult. Therefore, the paper proposes a method that combines episodic memory and reinforcement learning - Dual - Memory Reinforcement Learning (2M), aiming to take advantage of both to improve learning efficiency and performance.
### Main contributions of the paper
1. **Proposing the 2M framework**: This framework combines episodic control (EC) and reinforcement learning (RL) methods, taking advantage of both. This framework can be combined with any type of EC method and any type of off - policy RL method.
2. **Experimental verification**: Through experiments, it is shown that 2M is superior to pure episodic memory, pure reinforcement learning, and the state - of - the - art RL agents with memory enhancement. In addition, ablation studies are also included to explore the performance of 2M in different situations and its potential improvement directions.
### Method overview
The 2M agent maintains two "memories":
- **Episodic control (2M - EC)**: Used for fast learning and action selection.
- **Reinforcement learning (2M - RL)**: Used for optimization and generalization.
Before each episode begins, the 2M agent decides which memory to use for action selection. The collected data is not only used to update 2M - EC, but also enters the experience replay buffer for future updates of 2M - RL. As the training progresses, 2M - RL gradually becomes better at handling randomness and developing better feature representations, and the 2M agent will gradually switch from 2M - EC to 2M - RL.
### Experimental results
1. **WindyGrid environment**: The 2M agent has a faster learning speed than pure RL in the initial stage, and its final performance is better than that of pure EC.
2. **MinAtar games**: The 2M agent performs well in five MinAtar games, especially in the initial stage and asymptotic performance. The experiment also verifies the complementary effect of data sharing on 2M - RL and 2M - EC, further improving the overall performance.
### Ablation studies
1. **The impact of data sharing**: Data sharing helps the complementary learning between 2M - RL and 2M - EC, especially performing better in some games (such as Asterix).
2. **Research on scheduling mechanisms**: Different scheduling mechanisms (decay scheduling, constant scheduling, increasing scheduling) have different impacts on the performance of the 2M agent, and the decay scheduling usually performs the best.
In conclusion, this paper proposes a new 2M framework by combining episodic memory and reinforcement learning, aiming to improve learning efficiency and performance, and the experimental results verify the effectiveness of this method.