Abstract:We study high-dimensional multi-armed contextual bandits with batched feedback where the $T$ steps of online interactions are divided into $L$ batches. In specific, each batch collects data according to a policy that depends on previous batches and the rewards are revealed only at the end of the batch. Such a feedback structure is popular in applications such as personalized medicine and online advertisement, where the online data often do not arrive in a fully serial manner. We consider high-dimensional and linear settings where the reward function of the bandit model admits either a sparse or low-rank structure and ask how small a number of batches are needed for a comparable performance with fully dynamic data in which $L = T$. For these settings, we design a provably sample-efficient algorithm which achieves a $ \mathcal{\tilde O}(s_0^2 \log^2 T)$ regret in the sparse case and $ \mathcal{\tilde O} ( r ^2 \log^2 T)$ regret in the low-rank case, using only $L = \mathcal{O}( \log T)$ batches. Here $s_0$ and $r$ are the sparsity and rank of the reward parameter in sparse and low-rank cases, respectively, and $ \mathcal{\tilde O}(\cdot)$ omits logarithmic factors involving the feature dimensions. In other words, our algorithm achieves regret bounds comparable to those in fully sequential setting with only $\mathcal{O}( \log T)$ batches. Our algorithm features a novel batch allocation method that adjusts the batch sizes according to the estimation accuracy within each batch and cumulative regret. Furthermore, we also conduct experiments with synthetic and real-world data to validate our theory.

Batched Online Contextual Sparse Bandits with Sequential Inclusion of Features

Contextual Bandits with Similarity Information

Sequential Batch Learning in Finite-Action Linear Contextual Bandits

Selectively Contextual Bandits

Partially Observable Contextual Bandits with Linear Payoffs

Deep Contextual Multi-armed Bandits

Contextual Combinatorial Conservative Bandits

Batched Nonparametric Contextual Bandits

Achieving User-Side Fairness in Contextual Bandits

Provably Efficient High-Dimensional Bandit Learning with Batched Feedbacks

Multi-armed Bandits with Cost Subsidy

Contextual Bandit with Adaptive Feature Extraction

A Bayesian Approach for Subset Selection in Contextual Bandits.

Adapting multi-armed bandits policies to contextual bandits scenarios

Incentivising Exploration and Recommendations for Contextual Bandits with Payments

Robustly Improving Bandit Algorithms with Confounded and Selection Biased Offline Data: A Causal Approach

Privacy-Preserving Multi-Party Contextual Bandits

A Survey on Practical Applications of Multi-Armed and Contextual Bandits

BOF-UCB: A Bayesian-Optimistic Frequentist Algorithm for Non-Stationary Contextual Bandits

Batched Neural Bandits

Causal Feature Selection Method for Contextual Multi-Armed Bandits in Recommender System