DQN with model-based exploration: efficient learning on environments with sparse rewards

Stephen Zhen Gou,Yuyang Liu
DOI: https://doi.org/10.48550/arXiv.1903.09295
2019-03-22
Abstract:We propose Deep Q-Networks (DQN) with model-based exploration, an algorithm combining both model-free and model-based approaches that explores better and learns environments with sparse rewards more efficiently. DQN is a general-purpose, model-free algorithm and has been proven to perform well in a variety of tasks including Atari 2600 games since it's first proposed by Minh et el. However, like many other reinforcement learning (RL) algorithms, DQN suffers from poor sample efficiency when rewards are sparse in an environment. As a result, most of the transitions stored in the replay memory have no informative reward signal, and provide limited value to the convergence and training of the Q-Network. However, one insight is that these transitions can be used to learn the dynamics of the environment as a supervised learning problem. The transitions also provide information of the distribution of visited states. Our algorithm utilizes these two observations to perform a one-step planning during exploration to pick an action that leads to states least likely to be seen, thus improving the performance of exploration. We demonstrate our agent's performance in two classic environments with sparse rewards in OpenAI gym: Mountain Car and Lunar Lander.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to improve the exploration efficiency and learning speed of reinforcement learning algorithms (especially DQN) in environments with sparse rewards. In many practical applications, the environment may not frequently give out reward signals, which makes it difficult for standard DQN and other reinforcement learning algorithms to learn effectively, because most of the transitions stored in the experience replay memory do not provide useful reward information, which is of limited help to the convergence and training of the Q - network. To overcome this challenge, the author proposes an improved DQN algorithm that combines model - based and model - free methods. Specifically, this algorithm uses a prediction model of environmental dynamics to guide the exploration process. By choosing actions that are most likely to lead to states that are least likely to be encountered, it increases the probability of exploring new states. This method aims to explore the environment more efficiently and accelerate the learning process in sparse - reward environments. The two main contributions mentioned in the paper include: 1. **Combining model prediction for exploration**: Use a Dynamics Network to predict the next state given the current state and action, and combine it with the distribution model of recently visited states to choose actions that can lead to less - visited or unseen states, so as to enhance the exploration efficiency. 2. **Better performance in sparse - reward environments**: Experiments in two classic sparse - reward environments (Mountain Car and Lunar Lander) show that the proposed algorithm significantly outperforms the original DQN and other baseline methods in terms of exploration and learning speed in the Mountain Car environment. Through these improvements, the paper aims to provide a new solution to the reinforcement learning problem in sparse - reward environments.