Abstract:Contextual bandits constitute a classical framework for decision-making under uncertainty. In this setting, the goal is to learn the arms of highest reward subject to contextual information, while the unknown reward parameters of each arm need to be learned by experimenting that specific arm. Accordingly, a fundamental problem is that of balancing exploration (i.e., pulling different arms to learn their parameters), versus exploitation (i.e., pulling the best arms to gain reward). To study this problem, the existing literature mostly considers perfectly observed contexts. However, the setting of partial context observations remains unexplored to date, despite being theoretically more general and practically more versatile. We study bandit policies for learning to select optimal arms based on the data of observations, which are noisy linear functions of the unobserved context vectors. Our theoretical analysis shows that the Thompson sampling policy successfully balances exploration and exploitation. Specifically, we establish the followings: (i) regret bounds that grow poly-logarithmically with time, (ii) square-root consistency of parameter estimation, and (iii) scaling of the regret with other quantities including dimensions and number of arms. Extensive numerical experiments with both real and synthetic data are presented as well, corroborating the efficacy of Thompson sampling. To establish the results, we introduce novel martingale techniques and concentration inequalities to address partially observed dependent random variables generated from unspecified distributions, and also leverage problem-dependent information to sharpen probabilistic bounds for time-varying suboptimality gaps. These techniques pave the road towards studying other decision-making problems with contextual information as well as partial observations.

Adaptive Portfolio by Solving Multi-armed Bandit Via Thompson Sampling

Improving Portfolio Optimization Results with Bandit Networks

Thompson Sampling Algorithms for Mean-Variance Bandits

Risk-Aware Multi-Armed Bandit Problem with Application to Portfolio Selection

Portfolio Choices with Orthogonal Bandit Learning

Thompson Sampling for Budgeted Multi-Armed Bandits

Efficient and Adaptive Posterior Sampling Algorithms for Bandits

Kolmogorov-Smirnov Test-Based Actively-Adaptive Thompson Sampling for Non-Stationary Bandits

Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays

Risk-averse Contextual Multi-armed Bandit Problem with Linear Payoffs

A Unifying Theory of Thompson Sampling for Continuous Risk-Averse Bandits

Contextual combinatorial bandit on portfolio management

Multi-Armed Bandit Strategies for Non-Stationary Reward Distributions and Delayed Feedback Processes

Thompson Sampling in Partially Observable Contextual Bandits

Parallelizing Thompson Sampling

Multiarmed Bandits Problem Under the Mean-Variance Setting

A non-parametric solution to the multi-armed bandit problem with covariates

Challenges in Statistical Analysis of Data Collected by a Bandit Algorithm: An Empirical Exploration in Applications to Adaptively Randomized Experiments

Thompson sampling with the online bootstrap

A Risk-Averse Framework for Non-Stationary Stochastic Multi-Armed Bandits

Thompson Sampling in Switching Environments with Bayesian Online Change Point Detection