Anytime Sequential Halving in Monte-Carlo Tree Search

Dominic Sagers,Mark H.M. Winands,Dennis J.N.J. Soemers
2024-11-12
Abstract:Monte-Carlo Tree Search (MCTS) typically uses multi-armed bandit (MAB) strategies designed to minimize cumulative regret, such as UCB1, as its selection strategy. However, in the root node of the search tree, it is more sensible to minimize simple regret. Previous work has proposed using Sequential Halving as selection strategy in the root node, as, in theory, it performs better with respect to simple regret. However, Sequential Halving requires a budget of iterations to be predetermined, which is often impractical. This paper proposes an anytime version of the algorithm, which can be halted at any arbitrary time and still return a satisfactory result, while being designed such that it approximates the behavior of Sequential Halving. Empirical results in synthetic MAB problems and ten different board games demonstrate that the algorithm's performance is competitive with Sequential Halving and UCB1 (and their analogues in MCTS).
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in Monte - Carlo Tree Search (MCTS), how to design a selection strategy algorithm with the "anytime" feature to replace the traditional Sequential Halving (SH) algorithm. ### Problem Background MCTS is a search algorithm for different sequential decision - making problems and is widely used in fields such as games, planning, optimization, and control. One of the four key steps in MCTS is the selection strategy, and the Multi - Armed Bandit (MAB) algorithm is usually used to balance exploration and exploitation. Commonly used MAB algorithms such as UCB1 are mainly used to minimize cumulative regret, while Sequential Halving focuses on minimizing simple regret, which is more suitable in the root node of MCTS. However, Sequential Halving needs to pre - determine the budget of the number of iterations, which is often impractical in practical applications. For example, when dealing with a large and diverse set of games, automatically generated games, or agents with intelligent time management, the lack of the "anytime" feature will lead to performance problems. ### Paper Solution To solve the above problems, this paper proposes a new algorithm - Anytime Sequential Halving. This algorithm can be terminated at any time point and return a satisfactory result while maintaining behavior similar to Sequential Halving. Specifically: - **Anytime Termination Feature**: Anytime SH can be terminated at any time point, and as the processing time increases, the quality of its final decision will gradually improve. - **Behavior Approximation**: Anytime SH is inspired by the standard Sequential Halving, but is adjusted to have the anytime termination feature. - **Experimental Verification**: Through experiments in synthetic MAB problems and ten different board games, the results show that the performance of Anytime SH is comparable to that of UCB1 and Sequential Halving while retaining the anytime termination feature. ### Summary The core problem of this paper is to improve the selection strategy in MCTS so that it can not only minimize simple regret but also have flexible time - management capabilities in practical applications. The proposed Anytime SH algorithm effectively solves this problem and performs well in experiments.