Abstract:Thompson sampling (TS) is one of the most popular and earliest algorithms to solve stochastic multi-armed bandit problems. We consider a variant of TS, named $\alpha$-TS, where we use a fractional or $\alpha$-posterior ($\alpha\in(0,1)$) instead of the standard posterior distribution. To compute an $\alpha$-posterior, the likelihood in the definition of the standard posterior is tempered with a factor $\alpha$. For $\alpha$-TS we obtain both instance-dependent $\mathcal{O}\left(\sum_{k \neq i^*} \Delta_k\left(\frac{\log(T)}{C(\alpha)\Delta_k^2} + \frac{1}{2} \right)\right)$ and instance-independent $\mathcal{O}(\sqrt{KT\log K})$ frequentist regret bounds under very mild conditions on the prior and reward distributions, where $\Delta_k$ is the gap between the true mean rewards of the $k^{th}$ and the best arms, and $C(\alpha)$ is a known constant. Both the sub-Gaussian and exponential family models satisfy our general conditions on the reward distribution. Our conditions on the prior distribution just require its density to be positive, continuous, and bounded. We also establish another instance-dependent regret upper bound that matches (up to constants) to that of improved UCB [Auer and Ortner, 2010]. Our regret analysis carefully combines recent theoretical developments in the non-asymptotic concentration analysis and Bernstein-von Mises type results for the $\alpha$-posterior distribution. Moreover, our analysis does not require additional structural properties such as closed-form posteriors or conjugate priors.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper attempts to solve the theoretical analysis problems of posterior distribution and regret bounds in the Multi - Armed Bandit (MAB) problem. Specifically, the paper introduces an algorithm variant named α - Thompson Sampling (α - TS), in which the fractional posterior is used to replace the traditional posterior distribution. The main objectives of the paper are: 1. **Derive the regret bounds of α - TS**: - **Instance - dependent regret bounds**: Under very mild prior and reward distribution conditions, derive the instance - dependent regret bound $ O\left(\sum_{k \neq i^*} \frac{\Delta_k \log(T)}{C(\alpha) \Delta_k^2} + \frac{1}{2}\right) $. - **Instance - independent regret bounds**: Also under these conditions, derive the instance - independent regret bound $ O(\sqrt{KT \log K}) $. 2. **Expand the scope of application of Thompson Sampling**: - The regret bound analysis of traditional Thompson Sampling usually assumes a specific form of posterior distribution (such as conjugate prior). In this paper, by using α - posterior, the requirements for prior and reward distributions are relaxed, making it applicable to a wider range of situations. 3. **Combine modern statistical theories**: - The paper utilizes non - asymptotic concentration analysis and Bernstein - von Mises - type results, which are theoretical tools recently developed in Bayesian statistics. Through these tools, the paper can analyze the properties of α - posterior more precisely and derive regret bounds. ### Main contributions - **Regret bounds under general conditions**: The paper proposes general conditions for reward and prior distributions, which cover sub - Gaussian and exponential family reward distributions. These conditions are more relaxed than those in previous studies, making α - TS applicable to a wider range of scenarios. - **No need for specific structural assumptions**: Different from many previous studies, the analysis in this paper does not need to assume a closed - form posterior distribution or conjugate prior, which greatly expands the scope of application of Thompson Sampling. - **Verification of theoretical results**: The paper verifies the performance of α - TS in practical applications through numerical experiments, especially its performance under different α values, showing that α - TS has similar performance to standard Thompson Sampling when α is close to 1. ### Related work - **Early work**: Lai et al. first proposed theoretical results for the MAB problem in 1985, establishing a general asymptotic lower bound. Burnetas and Katehakis further generalized this result in 1996. - **Regret bounds of UCB and TS**: Agrawal and Goyal established problem - independent regret bounds for Thompson Sampling in 2017, but assumed a specific form of posterior distribution. Mazumdar et al. provided regret bounds under more general conditions in 2020, but required strict structural assumptions. - **Modern developments**: Jin et al. proposed a minimax - optimal Gaussian TS algorithm in 2021, but needed to know the time horizon in advance. Fan and Glynn studied the limit distribution properties of stochastic regret in 2021 instead of only calculating the expected regret in a finite time. In conclusion, this paper significantly expands the theoretical basis and practical application scope of Thompson Sampling by introducing α - Thompson Sampling and deriving its regret bounds under very general conditions.

Generalized Regret Analysis of Thompson Sampling using Fractional Posteriors

Efficient and Adaptive Posterior Sampling Algorithms for Bandits

Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays

A General Recipe for the Analysis of Randomized Multi-Armed Bandit Algorithms

On the Prior Sensitivity of Thompson Sampling

The Choice of Noninformative Priors for Thompson Sampling in Multiparameter Bandit Models

Thompson Sampling For Combinatorial Bandits: Polynomial Regret and Mismatched Sampling Paradox

Optimality of Thompson Sampling with Noninformative Priors for Pareto Bandits

A Thompson Sampling Algorithm with Logarithmic Regret for Unimodal Gaussian Bandit.

Thompson Sampling for Infinite-Horizon Discounted Decision Processes

Analysis and Design of Thompson Sampling for Stochastic Partial Monitoring

Self-accelerated Thompson Sampling with Near-Optimal Regret Upper Bound

An Analysis of Ensemble Sampling

Sliding-Window Thompson Sampling for Non-Stationary Settings

Finite-Time Frequentist Regret Bounds of Multi-Agent Thompson Sampling on Sparse Hypergraphs

Feel-good thompson sampling for contextual bandits and reinforcement learning

Learning to Optimize via Posterior Sampling

Thompson Sampling for Budgeted Multi-Armed Bandits

Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds

The Hardness Analysis of Thompson Sampling for Combinatorial Semi-bandits with Greedy Oracle

A Unifying Theory of Thompson Sampling for Continuous Risk-Averse Bandits