Abstract:This paper tackles the risk averse multi-armed bandits problem when incurred losses are non-stationary. The conditional value-at-risk (CVaR) is used as the objective function. Two estimation methods are proposed for this objective function in the presence of non-stationary losses, one relying on a weighted empirical distribution of losses and another on the dual representation of the CVaR. Such estimates can then be embedded into classic arm selection methods such as epsilon-greedy policies. Simulation experiments assess the performance of the arm selection algorithms based on the two novel estimation approaches, and such policies are shown to outperform naive benchmarks not taking non-stationarity into account.
What problem does this paper attempt to address?
### What problem does this paper attempt to solve?
This paper aims to solve the **risk - aversion problem in non - stationary Multi - Armed Bandits (MAB) problems**. Specifically, the paper focuses on how to select the optimal "arm" to minimize Conditional Value - at - Risk (CVaR) when the loss distribution changes over time. CVaR is a risk metric for measuring extreme losses and is especially suitable for situations where one unfavorable result may lead to serious consequences.
#### Main challenges
1. **Non - stationarity**: Traditional MAB problems assume that the loss distribution of each arm is fixed, but in practical applications, the loss distribution may change over time. This change makes traditional stationarity - based methods no longer applicable.
2. **Risk aversion**: Traditional MAB problems usually focus on maximizing expected returns, but in this paper, the author focuses on minimizing risk, especially extreme risk (such as CVaR), because decisions in some fields are very sensitive to risk (e.g., medical trials, financial investments, etc.).
#### Solutions
To solve these problems, the author proposes two new CVaR estimation methods:
1. **Estimation method based on weighted empirical distribution**: Estimate CVaR by giving greater weights to the most recent observations, so as to better adapt to the changes in the loss distribution.
2. **Recursive estimation method based on the dual representation of CVaR**: By using the dual representation form of CVaR and the recursive update formula, CVaR can be efficiently estimated in an online environment without the need to store all historical loss data.
#### Experimental verification
The author verifies the effectiveness of these two methods through simulation experiments. The experimental results show that, compared with the traditional sample mean estimation method, the two newly proposed methods perform better in dealing with non - stationary losses, can more accurately identify the arm with the least risk, and reduce the cumulative regret.
#### Formula summary
- Definition of CVaR:
\[
\text{CVaR}_\alpha(Z)=\frac{1}{1 - \alpha}\int_\alpha^1 q_u(Z)\,du
\]
where \(q_\alpha(Z)=\inf\{z\in\mathbb{R}:F_Z(z)\geq\alpha\}\), and \(F_Z(z)\) is the cumulative distribution function (CDF) of the loss random variable \(Z\).
- CVaR estimation of weighted empirical distribution:
\[
\hat{\text{CVaR}}_{\alpha,n}^{\text{(weight)}}(Z_{n + 1})=\frac{\sum_{i = 1}^n w_i^{(n+1)}Z_i\mathbf{1}\{Z_i\geq\tilde{q}_n^\alpha(Z_{n+1})\}}{\sum_{i = 1}^n w_i^{(n+1)}\mathbf{1}\{Z_i\geq\tilde{q}_n^\alpha(Z_{n+1})\}}
\]
- Recursive update formula:
\[
E_{n+1,c,\alpha}=E_{n,c,\alpha}+\lambda\left(f_{c,\alpha}^{\text{CVaR}}(Z_n)-E_{n,c,\alpha}\right)
\]
where \(f_{c,\alpha}^{\text{CVaR}}(z)=c+\frac{1}{1-\alpha}(z - c)\mathbf{1}\{z > c\}\).
Through these methods, the paper successfully solves the risk - averse multi - armed bandit problem in non - stationary environments and provides effective solutions for practical applications.