Abstract:The AlphaZero/MuZero (A/MZ) family of algorithms has achieved remarkable success across various challenging domains by integrating Monte Carlo Tree Search (MCTS) with learned models. Learned models introduce epistemic uncertainty, which is caused by learning from limited data and is useful for exploration in sparse reward environments. MCTS does not account for the propagation of this uncertainty however. To address this, we introduce Epistemic MCTS (EMCTS): a theoretically motivated approach to account for the epistemic uncertainty in search and harness the search for deep exploration. In the challenging sparse-reward task of writing code in the Assembly language SUBLEQ, AZ paired with our method achieves significantly higher sample efficiency over baseline AZ. Search with EMCTS solves variations of the commonly used hard-exploration benchmark Deep Sea - which baseline A/MZ are practically unable to solve - much faster than an otherwise equivalent method that does not use search for uncertainty estimation, demonstrating significant benefits from search for epistemic uncertainty estimation.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the problem of how reinforcement learning (RL) algorithms can effectively conduct deep exploration in sparse - reward environments. Specifically, the paper improves the Monte Carlo Tree Search (MCTS) method in the AlphaZero/MuZero (A/MZ) series of algorithms and introduces the Epistemic MCTS (EMCTS) method to consider epistemic uncertainty in the learning model. Epistemic uncertainty is the model - prediction uncertainty caused by training with limited data, and this uncertainty is very important for exploration in sparse - reward environments. ### Main contributions 1. **Combining MCTS and epistemic uncertainty**: - Propose a theoretically - grounded method to use MCTS and epistemic uncertainty for upper - confidence - bound - based deep exploration. - Propagate epistemic uncertainty in the value and/or environmental dynamics model through the search process. 2. **Parallel implementation**: - Implement EMCTS paired with AZ agents and an environment implementing the subleq sub - language in JAX for parallel implementation. 3. **Experimental verification**: - Evaluate the effect of EMCTS in the programming task subleq and the commonly - used hard - exploration benchmark Deep Sea. The results show that EMCTS is significantly superior to the baseline method in terms of sample efficiency and can conduct more effective exploration in sparse - reward environments. ### Background - **Reinforcement learning**: Agents learn behavior strategies through interaction with the environment, with the goal of maximizing the expected discounted return starting from the initial state distribution. - **Monte Carlo Tree Search (MCTS)**: Build a tree structure through four steps: selection, expansion, simulation, and backtracking to estimate the objective function. - **Uncertainty quantification in deep reinforcement learning**: A common method is to quantify epistemic uncertainty through the variance of random variables. ### Deep exploration and Epistemic MCTS - **Epistemic uncertainty**: By estimating epistemic uncertainty in the model, agents can identify areas where the model is uncertain and guide exploration through uncertainty. - **Epistemic P/UCT**: Propose Epistemic P/UCT (EP/UCT) and Epistemic PUCT (EPUCT) strategies to track the maximum upper - confidence - bound (UCB) exploration target. - **Propagation of epistemic uncertainty**: Propagate epistemic uncertainty in node - value predictions during the search process to ensure accurate uncertainty estimation during the search process. ### Experimental results - **subleq experiment**: In the programming task subleq, EMCTS can find the correct program with fewer samples, significantly improving sample efficiency. - **Deep Sea benchmark**: In the Deep Sea benchmark, EMCTS demonstrates deep - exploration capabilities and solves tasks that the baseline A/MZ cannot solve within a reasonable number of samples. ### Conclusion By introducing Epistemic MCTS, the paper successfully solves the problem of deep exploration in sparse - reward environments, improves sample efficiency, and demonstrates its effectiveness in multiple tasks. EMCTS is not only applicable to deep exploration but can also be used for other purposes, such as reducing over - estimation errors and weighting value and policy losses.

Epistemic Monte Carlo Tree Search

Dual Monte Carlo Tree Search

Monte-Carlo Graph Search for AlphaZero

LightZero: A Unified Benchmark for Monte Carlo Tree Search in General Sequential Decision Scenarios

ReZero: Boosting MCTS-based Algorithms by Backward-view and Entire-buffer Reanalyze

Belief-state Monte-Carlo Tree Search for Phantom Games

Monte Carlo Tree Search with Boltzmann Exploration

Amplifying Exploration in Monte-Carlo Tree Search by Focusing on the Unknown

Elastic Monte Carlo Tree Search with State Abstraction for Strategy Game Playing

A Unified Perspective on Value Backup and Exploration in Monte-Carlo Tree Search

Optimized Monte Carlo Tree Search for Enhanced Decision Making in the FrozenLake Environment

Provably Efficient Long-Horizon Exploration in Monte Carlo Tree Search through State Occupancy Regularization

Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B

Monte Carlo Tree Search in the Presence of Transition Uncertainty

Efficient Monte Carlo Tree Search via On-the-Fly State-Conditioned Action Abstraction

Decision Making in Non-Stationary Environments with Policy-Augmented Search

Fittest Survival: an Enhancement Mechanism for Monte Carlo Tree Search.

Evolving the MCTS Upper Confidence Bounds for Trees Using a Semantic-inspired Evolutionary Algorithm in the Game of Carcassonne

Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning

Adaptive Warm-Start MCTS in AlphaZero-like Deep Reinforcement Learning

AlphaZero-Inspired Game Learning: Faster Training by Using MCTS Only at Test Time