Tabular and Deep Learning for the Whittle Index

Francisco Robledo Relaño,Vivek Borkar,Urtzi Ayesta,Konstantin Avrachenkov
DOI: https://doi.org/10.1145/3670686
2024-06-04
Abstract:The Whittle index policy is a heuristic that has shown remarkably good performance (with guaranteed asymptotic optimality) when applied to the class of problems known as Restless Multi-Armed Bandit Problems (RMABPs). In this paper we present QWI and QWINN, two reinforcement learning algorithms, respectively tabular and deep, to learn the Whittle index for the total discounted criterion. The key feature is the use of two time-scales, a faster one to update the state-action Q -values, and a relatively slower one to update the Whittle indices. In our main theoretical result we show that QWI, which is a tabular implementation, converges to the real Whittle indices. We then present QWINN, an adaptation of QWI algorithm using neural networks to compute the Q -values on the faster time-scale, which is able to extrapolate information from one state to another and scales naturally to large state-space environments. For QWINN, we show that all local minima of the Bellman error are locally stable equilibria, which is the first result of its kind for DQN-based schemes. Numerical computations show that QWI and QWINN converge faster than the standard Q -learning algorithm, neural-network based approximate Q-learning and other state of the art algorithms.
Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
This paper focuses on the learning of Whittle Index in Restless Multi-Armed Bandit Problems (RMABPs). Whittle Index is a heuristic strategy commonly used to solve such problems, especially when the scale of RMABPs is large. The paper proposes two reinforcement learning algorithms, QWI (table-based) and QWINN (deep learning-based), to solve the Whittle Index under total discounted reward conditions. The QWI algorithm adopts a dual time-scale update, quickly updating the state-action value (Q-value) and slowly updating the Whittle Index. Theoretical results show that QWI can converge to the true Whittle Index. QWINN is a neural network version of QWI, which can infer from one state information to another state. It is suitable for large-scale state space environments and for the first time proves the stability of local minima based on the DQN method. The paper also compares the performance of QWI and QWINN with standard Q-learning, neural network approximate Q-learning, and other state-of-the-art algorithms. It is found that they perform better in terms of convergence speed and discounted reward optimization, especially QWINN can obtain accurate Whittle Index even with limited data samples. The study does not analyze the regret of the algorithms, but focuses on the performance of learning strategies over time and the convergence of Whittle Index.