Tabular and Deep Learning for the Whittle Index

Francisco Robledo Relaño,Vivek Borkar,Urtzi Ayesta,Konstantin Avrachenkov

DOI: https://doi.org/10.1145/3670686

2024-06-04

Abstract:The Whittle index policy is a heuristic that has shown remarkably good performance (with guaranteed asymptotic optimality) when applied to the class of problems known as Restless Multi-Armed Bandit Problems (RMABPs). In this paper we present QWI and QWINN, two reinforcement learning algorithms, respectively tabular and deep, to learn the Whittle index for the total discounted criterion. The key feature is the use of two time-scales, a faster one to update the state-action Q -values, and a relatively slower one to update the Whittle indices. In our main theoretical result we show that QWI, which is a tabular implementation, converges to the real Whittle indices. We then present QWINN, an adaptation of QWI algorithm using neural networks to compute the Q -values on the faster time-scale, which is able to extrapolate information from one state to another and scales naturally to large state-space environments. For QWINN, we show that all local minima of the Bellman error are locally stable equilibria, which is the first result of its kind for DQN-based schemes. Numerical computations show that QWI and QWINN converge faster than the standard Q -learning algorithm, neural-network based approximate Q-learning and other state of the art algorithms.

Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

This paper focuses on the learning of Whittle Index in Restless Multi-Armed Bandit Problems (RMABPs). Whittle Index is a heuristic strategy commonly used to solve such problems, especially when the scale of RMABPs is large. The paper proposes two reinforcement learning algorithms, QWI (table-based) and QWINN (deep learning-based), to solve the Whittle Index under total discounted reward conditions. The QWI algorithm adopts a dual time-scale update, quickly updating the state-action value (Q-value) and slowly updating the Whittle Index. Theoretical results show that QWI can converge to the true Whittle Index. QWINN is a neural network version of QWI, which can infer from one state information to another state. It is suitable for large-scale state space environments and for the first time proves the stability of local minima based on the DQN method. The paper also compares the performance of QWI and QWINN with standard Q-learning, neural network approximate Q-learning, and other state-of-the-art algorithms. It is found that they perform better in terms of convergence speed and discounted reward optimization, especially QWINN can obtain accurate Whittle Index even with limited data samples. The study does not analyze the regret of the algorithms, but focuses on the performance of learning strategies over time and the convergence of Whittle Index.

Tabular and Deep Learning for the Whittle Index

Finite-Time Analysis of Whittle Index based Q-Learning for Restless Multi-Armed Bandits with Neural Network Function Approximation

Whittle Index Learning Algorithms for Restless Bandits with Constant Stepsizes

Whittle Index with Multiple Actions and State Constraint for Inventory Management

Tabular and Deep Reinforcement Learning for Gittins Index

GINO-Q: Learning an Asymptotically Optimal Index Policy for Restless Multi-armed Bandits

Faster Q-Learning Algorithms for Restless Bandits

A unifying computations of Whittle's Index for Markovian bandits

Testing Indexability and Computing Whittle and Gittins Index in Subcubic Time

ContextWIN: Whittle Index Based Mixture-of-Experts Neural Model For Restless Bandits Via Deep RL

Indexability of Finite State Restless Multi-Armed Bandit and Rollout Policy

Whittle Index Based Q-Learning for Wireless Edge Caching with Linear Function Approximation

Scalable Decision-Focused Learning in Restless Multi-Armed Bandits with Application to Maternal and Child Health

MathDQN: Solving Arithmetic Word Problems Via Deep Reinforcement Learning.

PCL-Indexability and Whittle Index for Restless Bandits with General Observation Models

Whittle Index Policy for Dynamic Multichannel Allocation in Remote State Estimation

Learning Augmented Index Policy for Optimal Service Placement at the Network Edge

Optimizing AoI at Query in Multiuser Wireless Uplink Networks: A Whittle Index Approach

Collapsing Bandits and Their Application to Public Health Interventions

An Information-Theoretic Optimality Principle for Deep Reinforcement Learning

Reinforcement Learning Augmented Asymptotically Optimal Index Policy for Finite-Horizon Restless Bandits.