Epistemic Monte Carlo Tree Search

Yaniv Oren,Villiam Vadocz,Matthijs T. J. Spaan,Wendelin Böhmer
2024-10-04
Abstract:The AlphaZero/MuZero (A/MZ) family of algorithms has achieved remarkable success across various challenging domains by integrating Monte Carlo Tree Search (MCTS) with learned models. Learned models introduce epistemic uncertainty, which is caused by learning from limited data and is useful for exploration in sparse reward environments. MCTS does not account for the propagation of this uncertainty however. To address this, we introduce Epistemic MCTS (EMCTS): a theoretically motivated approach to account for the epistemic uncertainty in search and harness the search for deep exploration. In the challenging sparse-reward task of writing code in the Assembly language SUBLEQ, AZ paired with our method achieves significantly higher sample efficiency over baseline AZ. Search with EMCTS solves variations of the commonly used hard-exploration benchmark Deep Sea - which baseline A/MZ are practically unable to solve - much faster than an otherwise equivalent method that does not use search for uncertainty estimation, demonstrating significant benefits from search for epistemic uncertainty estimation.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the problem of how reinforcement learning (RL) algorithms can effectively conduct deep exploration in sparse - reward environments. Specifically, the paper improves the Monte Carlo Tree Search (MCTS) method in the AlphaZero/MuZero (A/MZ) series of algorithms and introduces the Epistemic MCTS (EMCTS) method to consider epistemic uncertainty in the learning model. Epistemic uncertainty is the model - prediction uncertainty caused by training with limited data, and this uncertainty is very important for exploration in sparse - reward environments. ### Main contributions 1. **Combining MCTS and epistemic uncertainty**: - Propose a theoretically - grounded method to use MCTS and epistemic uncertainty for upper - confidence - bound - based deep exploration. - Propagate epistemic uncertainty in the value and/or environmental dynamics model through the search process. 2. **Parallel implementation**: - Implement EMCTS paired with AZ agents and an environment implementing the subleq sub - language in JAX for parallel implementation. 3. **Experimental verification**: - Evaluate the effect of EMCTS in the programming task subleq and the commonly - used hard - exploration benchmark Deep Sea. The results show that EMCTS is significantly superior to the baseline method in terms of sample efficiency and can conduct more effective exploration in sparse - reward environments. ### Background - **Reinforcement learning**: Agents learn behavior strategies through interaction with the environment, with the goal of maximizing the expected discounted return starting from the initial state distribution. - **Monte Carlo Tree Search (MCTS)**: Build a tree structure through four steps: selection, expansion, simulation, and backtracking to estimate the objective function. - **Uncertainty quantification in deep reinforcement learning**: A common method is to quantify epistemic uncertainty through the variance of random variables. ### Deep exploration and Epistemic MCTS - **Epistemic uncertainty**: By estimating epistemic uncertainty in the model, agents can identify areas where the model is uncertain and guide exploration through uncertainty. - **Epistemic P/UCT**: Propose Epistemic P/UCT (EP/UCT) and Epistemic PUCT (EPUCT) strategies to track the maximum upper - confidence - bound (UCB) exploration target. - **Propagation of epistemic uncertainty**: Propagate epistemic uncertainty in node - value predictions during the search process to ensure accurate uncertainty estimation during the search process. ### Experimental results - **subleq experiment**: In the programming task subleq, EMCTS can find the correct program with fewer samples, significantly improving sample efficiency. - **Deep Sea benchmark**: In the Deep Sea benchmark, EMCTS demonstrates deep - exploration capabilities and solves tasks that the baseline A/MZ cannot solve within a reasonable number of samples. ### Conclusion By introducing Epistemic MCTS, the paper successfully solves the problem of deep exploration in sparse - reward environments, improves sample efficiency, and demonstrates its effectiveness in multiple tasks. EMCTS is not only applicable to deep exploration but can also be used for other purposes, such as reducing over - estimation errors and weighting value and policy losses.