Abstract:Reinforcement Learning (RL) is a widely researched area in artificial intelligence that focuses on teaching agents decision-making through interactions with their environment. A key subset includes stochastic multi-armed bandit (MAB) and continuum-armed bandit (SCAB) problems, which model sequential decision-making under uncertainty. This review outlines the foundational models and assumptions of bandit problems, explores non-asymptotic theoretical tools like concentration inequalities and minimax regret bounds, and compares frequentist and Bayesian algorithms for managing exploration-exploitation trade-offs. We also extend the discussion to $K$-armed contextual bandits and SCAB, examining their methodologies, regret analyses, and discussing the relation between the SCAB problems and the functional data analysis. Finally, we highlight recent advances and ongoing challenges in the field.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the exploration - exploitation trade - off in the multi - armed bandit problem, and how to make sequential decisions under uncertainty. Specifically, the author reviews the multi - armed bandit problem in AI from a statistical perspective and explores the following aspects: 1. **Basic Models and Assumptions**: Introduces the basic models and assumptions of the multi - armed bandit problem, including the Stochastic Multi - Armed Bandit (MAB) and the Stochastic Continuous - Armed Bandit (SCAB). 2. **Non - Asymptotic Theoretical Tools**: Discusses non - asymptotic theoretical tools for analyzing the bandit problem, such as concentration inequalities and minimax regret bounds. These tools are crucial for quantifying the uncertainty in reward estimation. 3. **Algorithm Comparison**: Compares the algorithms of the frequentist school and the Bayesian school to manage the trade - off between exploration and exploitation. Specifically, algorithms such as Explore - Then - Commit, Upper Confidence Bound (UCB), and Thompson Sampling are discussed. 4. **Extended Discussion**: Extends the discussion to K - armed contextual bandits and SCAB problems, analyzes their methodologies, regret analysis, and explores the relationship between the SCAB problem and functional data analysis. 5. **Latest Progress and Challenges**: Summarizes the latest progress in this field and points out the current challenges, especially the applications in high - dimensional data and complex environments. ### Core Problem of the Paper The core problem of the paper is to understand and optimize the exploration - exploitation trade - off in the multi - armed bandit problem through statistical methods, thereby improving decision - making efficiency and reducing regret. This involves not only specific algorithm design but also the understanding and handling of different types of bandit problems (such as structured and unstructured bandits). ### Key Formulas and Concepts Some of the key formulas involved in the paper include: - **Regret Definition**: \[ Reg_T(\pi, v) := T\mu^*(v)-\mathbb{E}\left[\sum_{t = 1}^{T}X_t\right] \] where $\mu^*(v)=\max_{a\in A}\mu_a(v)$ is the optimal mean reward. - **Hoeffding Inequality**: \[ P\left(\left|\sum_{i = 1}^{n}(X_i-\mathbb{E}[X_i])\right|\geq t\right)\leq 2\exp\left(-\frac{2t^2}{\sum_{i = 1}^{n}(b_i - a_i)^2}\right) \] - **Concentration Inequality for Sub - Gaussian Distribution**: \[ P(|X|\geq t)\leq 2e^{-t^2/(2\sigma^2)} \] Through these formulas and theoretical tools, the paper aims to provide a comprehensive and in - depth understanding of the multi - armed bandit problem and provide directions and methods for future research.

Selective Reviews of Bandit Problems in AI via a Statistical View

Understanding the stochastic dynamics of sequential decision-making processes: A path-integral analysis of multi-armed bandits

A General Framework for Bandit Problems Beyond Cumulative Objectives

A Survey on Practical Applications of Multi-Armed and Contextual Bandits

Unified Models of Human Behavioral Agents in Bandits, Contextual Bandits and RL

A Survey of Risk-Aware Multi-Armed Bandits

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

The Bandit Whisperer: Communication Learning for Restless Bandits

Combinatorial Multivariant Multi-Armed Bandits with Applications to Episodic Reinforcement Learning and Beyond

Risk-Aware Multi-Armed Bandit Problem with Application to Portfolio Selection

Non-stationary Bandits with Habituation and Recovery Dynamics and Knapsack Constraints

Online Restless Multi-Armed Bandits with Long-Term Fairness Constraints

Provably Efficient Reinforcement Learning for Adversarial Restless Multi-Armed Bandits with Unknown Transitions and Bandit Feedback

Introduction to Multi-Armed Bandits

A Bayesian Approach to Learning Bandit Structure in Markov Decision Processes

Bayesian Reinforcement Learning: A Survey

Bandits with Concave Aggregated Reward

contextual: Evaluating Contextual Multi-Armed Bandit Problems in R

A Review of Reinforcement Learning in Financial Applications

Multi-Armed Bandits in Brain-Computer Interfaces

Forced Exploration in Bandit Problems