Restless Linear Bandits

Azadeh Khaleghi
2024-05-17
Abstract:A more general formulation of the linear bandit problem is considered to allow for dependencies over time. Specifically, it is assumed that there exists an unknown $\mathbb{R}^d$-valued stationary $\varphi$-mixing sequence of parameters $(\theta_t,~t \in \mathbb{N})$ which gives rise to pay-offs. This instance of the problem can be viewed as a generalization of both the classical linear bandits with iid noise, and the finite-armed restless bandits. In light of the well-known computational hardness of optimal policies for restless bandits, an approximation is proposed whose error is shown to be controlled by the $\varphi$-dependence between consecutive $\theta_t$. An optimistic algorithm, called LinMix-UCB, is proposed for the case where $\theta_t$ has an exponential mixing rate. The proposed algorithm is shown to incur a sub-linear regret of $\mathcal{O}\left(\sqrt{d n\mathrm{polylog}(n) }\right)$ with respect to an oracle that always plays a multiple of $\mathbb{E}\theta_t$. The main challenge in this setting is to ensure that the exploration-exploitation strategy is robust against long-range dependencies. The proposed method relies on Berbee's coupling lemma to carefully select near-independent samples and construct confidence ellipsoids around empirical estimates of $\mathbb{E}\theta_t$.
Machine Learning,Information Theory
What problem does this paper attempt to address?
The paper primarily focuses on addressing the decision optimization problem in a class of linear multi-armed bandit problems, considering the dependency between time series. Specifically, the authors focus on a more general linear multi-armed bandit problem that allows rewards to exhibit temporal correlation. This correlation is modeled by an unknown time series \(\theta_t\), which has certain mixing properties (i.e., \(\phi\)-mixing sequence), making the reward \(Y_t\) dependent on the inner product of the current action \(X_t\) and the parameter \(\theta_t\). The main contributions of the paper are as follows: 1. **Problem Modeling**: The authors propose a more generalized linear multi-armed bandit problem where the reward \(Y_t = \langle \theta_t, X_t \rangle\), with \(\theta_t\) being an unknown time series that is \(\phi\)-mixing, meaning it exhibits temporal correlation to some extent. This setting is more complex than the traditional case with independent and identically distributed noise. 2. **Strategy Approximation**: Given that the optimal strategy can be difficult to compute in certain cases (especially in non-stationary environments), the paper provides an approximation method to approach the optimal strategy. This approximation method is based on controlling the mixing coefficient \(\phi_1\) of the sequence and analyzing the norm of the sequence \(\theta_t\). 3. **Algorithm Design**: The paper proposes an algorithm called LinMix-UCB, which is suitable for cases where the parameter sequence has an exponential mixing rate. The algorithm utilizes the principle of Optimism in the Face of Uncertainty and employs Berbee's coupling lemma to select approximately independent samples to construct a confidence ellipsoid for the empirical estimate of \(\theta^*\) (i.e., the expected value of \(\theta_t\)). 4. **Theoretical Results**: It is proven that the LinMix-UCB algorithm can achieve sub-linear regret under given conditions, meaning that the expected regret grows as \(\sqrt{n}\) with time \(n\), including polynomial logarithmic terms. Additionally, the paper discusses the algorithm's performance in both finite-horizon and infinite-horizon scenarios. 5. **Outlook**: The paper concludes with a discussion of future research directions, including how to learn without knowing the mixing parameters and how to estimate these mixing parameters. Furthermore, it mentions that studying the regret lower bounds for this class of problems is also an open question.