Abstract:This paper tackles the risk averse multi-armed bandits problem when incurred losses are non-stationary. The conditional value-at-risk (CVaR) is used as the objective function. Two estimation methods are proposed for this objective function in the presence of non-stationary losses, one relying on a weighted empirical distribution of losses and another on the dual representation of the CVaR. Such estimates can then be embedded into classic arm selection methods such as epsilon-greedy policies. Simulation experiments assess the performance of the arm selection algorithms based on the two novel estimation approaches, and such policies are shown to outperform naive benchmarks not taking non-stationarity into account.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the **risk - aversion problem in non - stationary Multi - Armed Bandits (MAB) problems**. Specifically, the paper focuses on how to select the optimal "arm" to minimize Conditional Value - at - Risk (CVaR) when the loss distribution changes over time. CVaR is a risk metric for measuring extreme losses and is especially suitable for situations where one unfavorable result may lead to serious consequences. #### Main challenges 1. **Non - stationarity**: Traditional MAB problems assume that the loss distribution of each arm is fixed, but in practical applications, the loss distribution may change over time. This change makes traditional stationarity - based methods no longer applicable. 2. **Risk aversion**: Traditional MAB problems usually focus on maximizing expected returns, but in this paper, the author focuses on minimizing risk, especially extreme risk (such as CVaR), because decisions in some fields are very sensitive to risk (e.g., medical trials, financial investments, etc.). #### Solutions To solve these problems, the author proposes two new CVaR estimation methods: 1. **Estimation method based on weighted empirical distribution**: Estimate CVaR by giving greater weights to the most recent observations, so as to better adapt to the changes in the loss distribution. 2. **Recursive estimation method based on the dual representation of CVaR**: By using the dual representation form of CVaR and the recursive update formula, CVaR can be efficiently estimated in an online environment without the need to store all historical loss data. #### Experimental verification The author verifies the effectiveness of these two methods through simulation experiments. The experimental results show that, compared with the traditional sample mean estimation method, the two newly proposed methods perform better in dealing with non - stationary losses, can more accurately identify the arm with the least risk, and reduce the cumulative regret. #### Formula summary - Definition of CVaR: \[ \text{CVaR}_\alpha(Z)=\frac{1}{1 - \alpha}\int_\alpha^1 q_u(Z)\,du \] where \(q_\alpha(Z)=\inf\{z\in\mathbb{R}:F_Z(z)\geq\alpha\}\), and \(F_Z(z)\) is the cumulative distribution function (CDF) of the loss random variable \(Z\). - CVaR estimation of weighted empirical distribution: \[ \hat{\text{CVaR}}_{\alpha,n}^{\text{(weight)}}(Z_{n + 1})=\frac{\sum_{i = 1}^n w_i^{(n+1)}Z_i\mathbf{1}\{Z_i\geq\tilde{q}_n^\alpha(Z_{n+1})\}}{\sum_{i = 1}^n w_i^{(n+1)}\mathbf{1}\{Z_i\geq\tilde{q}_n^\alpha(Z_{n+1})\}} \] - Recursive update formula: \[ E_{n+1,c,\alpha}=E_{n,c,\alpha}+\lambda\left(f_{c,\alpha}^{\text{CVaR}}(Z_n)-E_{n,c,\alpha}\right) \] where \(f_{c,\alpha}^{\text{CVaR}}(z)=c+\frac{1}{1-\alpha}(z - c)\mathbf{1}\{z > c\}\). Through these methods, the paper successfully solves the risk - averse multi - armed bandit problem in non - stationary environments and provides effective solutions for practical applications.

Risk averse non-stationary multi-armed bandits

Best-Arm Identification Using Extreme Value Theory Estimates of the CVaR

A Risk-Averse Framework for Non-Stationary Stochastic Multi-Armed Bandits

Multiarmed Bandits Problem Under the Mean-Variance Setting

A Survey of Risk-Aware Multi-Armed Bandits

A central limit theorem, loss aversion and multi-armed bandits

Planning and Learning in Risk-Aware Restless Multi-Arm Bandit Problem

Non-Stationary Latent Auto-Regressive Bandits

Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR

Bridging Adversarial and Nonstationary Multi-armed Bandit

Risk-Aware Multi-Armed Bandit Problem with Application to Portfolio Selection

Finite-time Analysis of Globally Nonstationary Multi-Armed Bandits

Off-Policy Risk Assessment in Contextual Bandits

Non-stationary Bandits with Habituation and Recovery Dynamics and Knapsack Constraints

Risk-averse Contextual Multi-armed Bandit Problem with Linear Payoffs

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

A non-parametric solution to the multi-armed bandit problem with covariates

Continuous Mean-Covariance Bandits.

Near-Optimal MNL Bandits Under Risk Criteria

Risk-Averse Bayes-Adaptive Reinforcement Learning

Multi-Armed Bandit Strategies for Non-Stationary Reward Distributions and Delayed Feedback Processes