Abstract:We study a variant of the contextual bandit problem where an agent can intervene through a set of stochastic expert policies. Given a fixed context, each expert samples actions from a fixed conditional distribution. The agent seeks to remain competitive with the 'best' among the given set of experts. We propose the Divergence-based Upper Confidence Bound (D-UCB) algorithm that uses importance sampling to share information across experts and provide horizon-independent constant regret bounds that only scale linearly in the number of experts. We also provide the Empirical D-UCB (ED-UCB) algorithm that can function with only approximate knowledge of expert distributions. Further, we investigate the episodic setting where the agent interacts with an environment that changes over episodes. Each episode can have different context and reward distributions resulting in the best expert changing across episodes. We show that by bootstrapping from $\mathcal{O}\left(N\log\left(NT^2\sqrt{E}\right)\right)$ samples, ED-UCB guarantees a regret that scales as $\mathcal{O}\left(E(N+1) + \frac{N\sqrt{E}}{T^2}\right)$ for $N$ experts over $E$ episodes, each of length $T$. We finally empirically validate our findings through simulations.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper mainly studies a variant of the contextual bandit problem, in which agents can intervene through a set of random expert strategies. Specifically: 1. **Problem background**: - In recommendation systems, over time, the types of users and their preference distributions will change, causing the optimal recommendation to change. Therefore, the system needs to "reset" its learning process periodically. - The paper considers systems with known change points (i.e., "episode", or called phases), such as seasonal product recommendations, advertising placement based on different times of the day. 2. **Problem description**: - Each expert samples actions from a fixed conditional distribution according to the given context. The agent's goal is to remain competitive with the "best" expert among all given experts. - In the case of a single episode, the agent shares information through the Importance Sampling (IS) strategy and provides time - independent constant regret bounds, which only depend linearly on the number of experts. - In the episodic case, the environment changes between each episode, causing the best expert to also change. The paper proposes the Empirical D - UCB (ED - UCB) algorithm, which can work with only approximate knowledge of the expert distribution. 3. **Main contributions**: - Proposed the Divergence - Based Upper Confidence Bound (D - UCB) algorithm, which uses clipped importance and estimators to predict expert rewards and provides analytical results that converge exponentially fast to the mean. - For the Episodic setting, through the method of bootstrapping samples, it is proved that the regret bound of ED - UCB is $O\left(E(N + 1)+N\sqrt{\frac{E}{T^{2}}}\right)$, where $N$ is the number of experts, $E$ is the number of episodes, and $T$ is the length of each episode. - Verified the validity of the theoretical results through simulation experiments and showed that the performance of ED - UCB on multiple datasets is better than that of traditional multi - armed bandit strategies. 4. **Application examples**: - Take online advertising companies as an example. These companies select the most appropriate pre - trained recommendation model (i.e., expert) according to user characteristics and product features in each advertising campaign to maximize the user's click - through rate. Due to changes in user traffic and product lines, the best expert in each campaign may be different. In summary, this paper aims to solve the problem of how to efficiently select the best expert in a dynamic environment and proposes a series of innovative algorithms and techniques to achieve this goal.

Bandits with Stochastic Experts: Constant Regret, Empirical Experts and Episodes

Stochastic Conservative Contextual Linear Bandits

Distributed Stochastic Bandit Learning with Delayed Context Observation

Stochastic Bandits with Context Distributions

Contextual Bandits for Unbounded Context Distributions

Old Dog Learns New Tricks: Randomized UCB for Bandit Problems

Regret Analysis for Hierarchical Experts Bandit Problem

Context-lumpable stochastic bandits

Batched Neural Bandits

Improved Regret Bounds for Bandits with Expert Advice

Random Effect Bandits

BOF-UCB: A Bayesian-Optimistic Frequentist Algorithm for Non-Stationary Contextual Bandits

Second Order Regret Bounds Against Generalized Expert Sequences under Partial Bandit Feedback

On Abruptly-Changing and Slowly-Varying Multiarmed Bandit Problems

Cooperative Multi-Agent Graph Bandits: UCB Algorithm and Regret Analysis

A Simple Approach For Non-Stationary Linear Bandits

UCB algorithms for multi-armed bandits: Precise regret and adaptive inference

Contextual Bandits with Stage-wise Constraints

Byzantine-Resilient Decentralized Multi-Armed Bandits

Autoregressive Bandits

Bandits with Mean Bounds