Abstract:AlphaZero is a self-play reinforcement learning algorithm that achieves superhuman play in chess, shogi, and Go via policy iteration. To be an effective policy improvement operator, AlphaZero's search requires accurate value estimates for the states appearing in its search tree. AlphaZero trains upon self-play matches beginning from the initial state of a game and only samples actions over the first few moves, limiting its exploration of states deeper in the game tree. We introduce Go-Exploit, a novel search control strategy for AlphaZero. Go-Exploit samples the start state of its self-play trajectories from an archive of states of interest. Beginning self-play trajectories from varied starting states enables Go-Exploit to more effectively explore the game tree and to learn a value function that generalizes better. Producing shorter self-play trajectories allows Go-Exploit to train upon more independent value targets, improving value training. Finally, the exploration inherent in Go-Exploit reduces its need for exploratory actions, enabling it to train under more exploitative policies. In the games of Connect Four and 9x9 Go, we show that Go-Exploit learns with a greater sample efficiency than standard AlphaZero, resulting in stronger performance against reference opponents and in head-to-head play. We also compare Go-Exploit to KataGo, a more sample efficient reimplementation of AlphaZero, and demonstrate that Go-Exploit has a more effective search control strategy. Furthermore, Go-Exploit's sample efficiency improves when KataGo's other innovations are incorporated.
What problem does this paper attempt to address?
### What problem does this paper attempt to solve?
This paper aims to solve the problem of low sample efficiency in the training process of AlphaZero. Specifically, when AlphaZero is playing against itself, starting from the initial state of the game and only sampling actions in the first few steps, it has limited access to states for exploring deeper game trees. In addition, AlphaZero's exploration mechanism requires it to use a weaker exploratory strategy, which slows down the speed of policy iteration. At the same time, AlphaZero can only generate one noisy value target from a complete self - play, which slows down the training speed of the value function.
To solve these problems, the author proposes a new search control strategy - Go - Exploit. Go - Exploit improves AlphaZero in the following ways:
1. **Start self - play from diverse starting states**: Go - Exploit randomly selects starting states from a repository containing states of interest, so it can explore the game tree more effectively.
2. **Improve the generalization ability of the value function**: Through self - play starting from different starting states, Go - Exploit can train a more generalized and accurate value function.
3. **Increase independent value targets**: Go - Exploit can train more independent value targets by generating shorter self - play trajectories, thus speeding up the training of the value function.
4. **Reduce the need for exploratory actions**: The intrinsic exploration mechanism of Go - Exploit reduces the need for exploratory actions, allowing it to be trained under a more exploitative strategy.
Experimental results show that Go - Exploit exhibits higher sample efficiency than the standard AlphaZero in Connect Four and 9x9 Go games, thus achieving stronger performance.
### Key formulas
- **PUCT action selection rule**:
\[
a=\arg\max_a\left(Q(s, a)+c_{\text{puct}}P(s, a)\sqrt{\frac{N(s)}{1 + N(s, a)}}\right)
\]
where \(Q(s, a)\) is the action - value estimate, \(P(s, a)\) is the prior probability, \(N(s)\) and \(N(s, a)\) are the number of visits to the state and state - action pair respectively, and \(c_{\text{puct}}\) is the exploration constant.
- **Policy update rule**:
\[
\pi_t(a|s_t)=\frac{N(s_t, a)^{1/\tau}}{\sum_b N(s_t, b)^{1/\tau}}
\]
where \(\tau\) is the Softmax temperature, which controls the balance between exploration and exploitation.
- **Loss function**:
\[
\text{loss}=(z - v)^2-\pi_t^T\log(p)+c\|\theta\|^2
\]
where \(z\) is the game result, \(v\) is the value estimate, \(p\) is the prior for action selection, \(\theta\) is the neural network parameter, and \(c\) is the regularization constant.
Through these improvements, Go - Exploit can achieve better performance with fewer samples, thus improving the sample efficiency of AlphaZero.