A Survey on Contextual Multi-armed Bandits

Li Zhou
DOI: https://doi.org/10.48550/arXiv.1508.03326
2016-02-01
Abstract:In this survey we cover a few stochastic and adversarial contextual bandit algorithms. We analyze each algorithm's assumption and regret bound.
Machine Learning
What problem does this paper attempt to address?
This paper aims to solve how to optimize the decision - making process in the multi - armed bandits (MAB) problem in the presence of context information. Specifically, it attempts to select the optimal action (such as choosing an advertisement or a news article) according to the observed context information at each time point in order to maximize the cumulative reward (such as click - through rate or dwell time). The paper mainly focuses on how to use context information to improve the decision - making strategy and minimize the regret value, that is, the potential reward lost compared with the optimal strategy. ### Specific Problem Description 1. **Contextual Multi - armed Bandit Problem**: - In the traditional multi - armed bandit problem, an agent needs to make decisions within a series of time steps, choosing one "arm" (action) each time and obtaining a reward according to the result of the choice. However, in the contextual multi - armed bandit problem, the agent can also observe some additional information at each time point, called context. These contexts can be user characteristics, environmental states, etc. 2. **Objective**: - The agent's objective is to minimize the regret value by using context information. The regret value is defined as the gap between the cumulative reward that the optimal strategy can obtain and the actual cumulative reward among all possible strategies. 3. **Challenges**: - The existence of context information makes the problem more complex because the optimal action may change with the change of context. In addition, since the agent can only observe the reward of the selected action and cannot directly obtain the rewards of other unselected actions, this further increases the learning difficulty. ### Main Contributions of the Paper This paper provides a comprehensive review of the contextual multi - armed bandit problem and introduces a variety of algorithms and their theoretical guarantees. Specifically: - **Unbiased Reward Estimator**: Solves the partial - feedback problem, that is, only observing the reward of the selected action. - **Reduction to K - armed Bandit**: Reduces the contextual multi - armed bandit problem to multiple independent K - armed bandit problems. - **Stochastic Contextual Multi - armed Bandit**: Assumes that the reward follows a certain probability distribution and proposes several algorithms based on the linear realizability assumption (such as LinUCB, SupLinUCB, etc.). - **Adversarial Contextual Multi - armed Bandit**: Considers the situation where the reward may be chosen by an opponent and proposes corresponding algorithms (such as EXP4, EXP4.P, etc.). - **Kernelized Stochastic Contextual Multi - armed Bandit**: When the reward function is nonlinear, uses the kernel method for modeling (such as GP - UCB, KernelUCB, etc.). In short, this paper aims to provide a systematic framework and toolset to help researchers and practitioners better understand and solve the contextual multi - armed bandit problem.