Approximate Exploration through State Abstraction

Adrien Ali Taïga,Aaron Courville,Marc G. Bellemare
DOI: https://doi.org/10.48550/arXiv.1808.09819
2019-01-25
Abstract:Although exploration in reinforcement learning is well understood from a theoretical point of view, provably correct methods remain impractical. In this paper we study the interplay between exploration and approximation, what we call approximate exploration. Our main goal is to further our theoretical understanding of pseudo-count based exploration bonuses (Bellemare et al., 2016), a practical exploration scheme based on density modelling. As a warm-up, we quantify the performance of an exploration algorithm, MBIE-EB (Strehl and Littman, 2008), when explicitly combined with state aggregation. This allows us to confirm that, as might be expected, approximation allows the agent to trade off between learning speed and quality of the learned policy. Next, we show how a given density model can be related to an abstraction and that the corresponding pseudo-count bonus can act as a substitute in MBIE-EB combined with this abstraction, but may lead to either under- or over-exploration. Then, we show that a given density model also defines an implicit abstraction, and find a surprising mismatch between pseudo-counts derived either implicitly or explicitly. Finally we derive a new pseudo-count bonus alleviating this issue.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is in reinforcement learning, how to reduce sample complexity (i.e., reduce the number of required interactions) while ensuring a certain exploration efficiency. Specifically, the authors focus on the fact that in large - scale state spaces, theoretically guaranteed exploration methods are often not practical due to high sample complexity. Therefore, they study the method of approximate exploration, especially the exploration reward mechanism based on pseudo - count, in the hope of finding algorithms that converge to reasonable policies more quickly in practice. ### Main problem decomposition: 1. **Gap between theory and practice**: - Theoretically, many exploration algorithms (such as MBIE - EB) can provide performance guarantees within a limited time, but the sample complexity of these algorithms usually grows linearly with the number of environmental states, resulting in impracticality in large - scale environments. - In practice, some pseudo - count methods based on density models (such as the scheme proposed by Bellemare et al.) perform well in practical applications, but lack strict theoretical support. 2. **Challenges of approximate exploration**: - How to accelerate the exploration process by introducing abstraction at the cost of approximate optimality. - The differences in the behavior of pseudo - count methods at different abstraction levels may lead to over - exploration or under - exploration, affecting the quality of the final policy. 3. **Relationship between pseudo - count and abstraction**: - Pseudo - count is essentially an estimate of the actual number of visits, but when using abstraction, this estimate may be distorted, thereby affecting the exploration effect. - The authors explore how to alleviate these problems by adjusting the pseudo - count formula to ensure the effectiveness and rationality of exploration. ### Core objectives of the paper: - **Theoretical understanding**: Gain in - depth understanding of the performance of pseudo - count methods in non - tabular settings, especially how they interact with state abstraction. - **Method improvement**: Propose a new pseudo - count reward mechanism to alleviate the problems of under - exploration or over - exploration in existing methods. - **Empirical verification**: Verify the effectiveness of the new method through experiments and analyze its performance in different environments. ### Key conclusions: - Pseudo - count methods can indeed significantly accelerate the exploration process in practical applications, but may lead to under - exploration or over - exploration in some cases. - The proposed new pseudo - count formula can alleviate these problems to a certain extent, making exploration more efficient and stable. - The experimental results show that the new pseudo - count method exhibits better performance in environments such as grid worlds, especially when dealing with complex tasks. In conclusion, this paper aims to provide a more practical and effective solution to the exploration problem in reinforcement learning through theoretical analysis and empirical research, especially in large - scale state spaces.