Memory-Efficient Episodic Control Reinforcement Learning with Dynamic Online k-means

Andrea Agostinelli,Kai Arulkumaran,Marta Sarrico,Pierre Richemond,Anil Anthony Bharath
DOI: https://doi.org/10.48550/arXiv.1911.09560
2019-11-21
Abstract:Recently, neuro-inspired episodic control (EC) methods have been developed to overcome the data-inefficiency of standard deep reinforcement learning approaches. Using non-/semi-parametric models to estimate the value function, they learn rapidly, retrieving cached values from similar past states. In realistic scenarios, with limited resources and noisy data, maintaining meaningful representations in memory is essential to speed up the learning and avoid catastrophic forgetting. Unfortunately, EC methods have a large space and time complexity. We investigate different solutions to these problems based on prioritising and ranking stored states, as well as online clustering techniques. We also propose a new dynamic online k-means algorithm that is both computationally-efficient and yields significantly better performance at smaller memory sizes; we validate this approach on classic reinforcement learning environments and Atari games.
Machine Learning,Neural and Evolutionary Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to reduce the memory requirements of the neural - inspired Episodic Control (EC) method in reinforcement learning while maintaining or improving learning performance. Specifically, the paper focuses on how to effectively maintain meaningful memory representations in real - world scenarios with limited resources and high data noise to accelerate the learning process and avoid catastrophic forgetting. The authors propose a new Dynamic online k - means algorithm (DkM), which has significantly better performance with a smaller memory capacity, and validates the effectiveness of this method in classic reinforcement learning environments and Atari games. ### Main contributions: 1. **Evaluating different memory storage strategies**: The paper evaluates five different memory storage strategies, including a new online clustering algorithm, and applies them to two EC algorithms, which are tested in classic RL environments and Atari games. 2. **Performance**: The study finds that replacing the least - frequently - used states or using online clustering techniques performs best in multiple settings and environments. 3. **Proposing a new online clustering algorithm**: The paper proposes a new Dynamic online k - means algorithm (DkM), which outperforms other memory storage strategies when using a smaller memory capacity. ### Background knowledge: - **Reinforcement Learning (RL)**: Optimize the behavior of an agent through interaction with the environment, with the goal of learning an optimal policy to maximize the expected reward. - **Episodic Control (EC)**: Inspired by the rapid instance learning of the hippocampus in the brain, EC methods use non - parametric or semi - parametric models to estimate the value function and learn quickly by looking up past states. - **Memory storage strategy**: Under a limited memory capacity, how to selectively store states to support the agent in learning from recently observed states while avoiding catastrophic forgetting is the focus of research. ### Experimental results: - **Classic control tasks**: In the Cartpole and Acrobot tasks, DkM performs excellently with a small memory capacity and significantly outperforms other methods. - **Grid - world tasks**: In the OpenRoom and FourRoom tasks, DkM performs well with a small memory capacity, but with a high memory capacity, the LRU method can solve the problem better. - **Atari games**: In five Atari games, DkM performs best with a small memory capacity, but with a large memory capacity, the LRU method performs better. ### Conclusion: The Dynamic online k - means algorithm (DkM) proposed in the paper can effectively improve the learning performance of the EC method with a smaller memory capacity, making it more applicable in resource - constrained real - world scenarios. However, for a larger memory capacity, the traditional LRU method is still an effective choice.