Abstract:The upper confidence bound (UCB) policy is recognized as an order-optimal solution for the classical total-reward bandit problem. While similar UCB-based approaches have been applied to the max bandit problem, which aims to maximize the cumulative maximal reward, their order optimality remains unclear. In this study, we clarify the unified conditions under which the UCB policy achieves the order optimality in both total-reward and max bandit problems. A key concept of our theory is the oracle quantity, which identifies the best arm by its highest value. This allows a unified definition of the UCB policy as pulling the arm with the highest UCB of the oracle quantity. Additionally, under this setting, optimality analysis can be conducted by replacing traditional regret with the number of failures as a core measure. One consequence of our analysis is that the confidence interval of the oracle quantity must narrow appropriately as trials increase to ensure the order optimality of UCB policies. From this consequence, we prove that the previously proposed MaxSearch algorithm satisfies this condition and is an order-optimal policy for the max bandit problem. We also demonstrate that new bandit problems and their order-optimal UCB algorithms can be systematically derived by providing the appropriate oracle quantity and its confidence interval. Building on this, we propose PIUCB algorithms, which aim to pull the arm with the highest probability of improvement (PI). These algorithms can be applied to the max bandit problem in practice and perform comparably or better than the MaxSearch algorithm in toy examples. This suggests that our theory has the potential to generate new policies tailored to specific oracle quantities.

Analysis of UCT Algorithm Policies in Imperfect Information Game.

A Modified UCT Algorithm Basd on Risk Estimation Methods

Doing Better Than UCT: Rational Monte Carlo Sampling in Trees

Modified UCT algorithm with risk dominance methods in imperfect information game

Uct Based Search In Phantom Go

UCT Algorithm in Amazons Human-Computer Games

Backpropagation Modification in Monte-Carlo Game Tree Search

Towards Understanding the Effects of Evolving the MCTS UCT Selection Policy

Pruning in UCT Algorithm

Extreme Value Monte Carlo Tree Search

Exploration Analysis in Finite-Horizon Turn-based Stochastic Games.

Unified theory of upper confidence bound policies for bandit problems targeting total reward, maximal reward, and more

Modification of Uct Algorithm with Quiescent Search in Computer Go

An Optimal Computing Budget Allocation Tree Policy for Monte Carlo Tree Search

Combination of Auction Theory and Multi-Armed Bandits: Model, Algorithm, and Application

Imperfect and Cooperative Guandan Game System

An Analysis on the Effects of Evolving the Monte Carlo Tree Search Upper Confidence for Trees Selection Policy on Unimodal, Multimodal and Deceptive Landscapes

Monte Carlo Tree Search with Boltzmann Exploration

The Extended UCB Policies for Frequentist Multi-armed Bandit Problems

Fittest Survival: an Enhancement Mechanism for Monte Carlo Tree Search.

A Modification of UCT Algorithm for WTN-EinStein Würfelt Nicht! Game