Abstract:This paper studies bandit problems where an agent has access to offline data that might be utilized to potentially improve the estimation of each arm's reward distribution. A major obstacle in this setting is the existence of compound biases from the observational data. Ignoring these biases and blindly fitting a model with the biased data could even negatively affect the online learning phase. In this work, we formulate this problem from a causal perspective. First, we categorize the biases into confounding bias and selection bias based on the causal structure they imply. Next, we extract the causal bound for each arm that is robust towards compound biases from biased observational data. The derived bounds contain the ground truth mean reward and can effectively guide the bandit agent to learn a nearly-optimal decision policy. We also conduct regret analysis in both contextual and non-contextual bandit settings and show that prior causal bounds could help consistently reduce the asymptotic regret.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to improve the performance of the multi - armed bandit algorithm in biased offline data. Specifically, the paper focuses on how to use these potentially biased offline data to improve the accuracy of estimating the reward distribution of each option (arm) in the presence of confounding bias and selection bias. If these biases are ignored and the biased data is blindly used to fit the model, it may have a negative impact on the online learning stage. Therefore, from the perspective of causal inference, the paper proposes new methods to deal with these problems.
### Main Contributions
1. **Derivation of Causal Bounds**: Based on the causal model, the paper derives the causal bounds of the conditional causal effect in the presence of confounding bias and selection bias. These bounds can contain the true average reward value and can effectively guide the bandit algorithm to learn a near - optimal decision - making strategy.
2. **Utilization of Causal Bounds**: A new framework is proposed. By using the prior causal bounds obtained from the biased offline data to guide the arm - selection process in the bandit algorithm, the number of explorations of sub - optimal arms is reduced and the cumulative regret is decreased.
3. **Algorithm Implementation**: Two enhanced bandit algorithms - the contextual bandit algorithm (LinUCB - PCB) and the non - contextual bandit algorithm (UCB - PCB) are developed. By introducing prior causal bounds, these algorithms theoretically have lower regret than the non - causal versions of the algorithms under mild conditions. In addition, an empirical evaluation is carried out to prove the effectiveness in specific settings.
### Solutions
The paper solves the above problems through the following steps:
- **Causal Model and Bias Classification**: Based on Pearl's structural causal model (SCM), the biases are divided into confounding bias and selection bias, and their impacts are analyzed.
- **Causal Bound Calculation**: Using the c - component decomposition and alternative intervention methods, the causal bounds of each arm are calculated respectively, and the tighter bound is selected as the final result.
- **Online Learning Optimization**: These causal bounds are applied to the online learning process. By adjusting the upper and lower bounds, the arm - selection strategy is optimized, thereby reducing the number of explorations of sub - optimal arms and decreasing the cumulative regret.
### Formula Examples
- **Conditional Causal Effect**:
\[
u_{a,c} = E[Y | \text{do}(X = x_a), c]
\]
- **Causal Bounds**:
\[
L_q = \frac{\sum_{D \setminus \{Y, C\}} \prod_{i = 1}^l LQ[D_i]}{P(c)}
\]
\[
U_q = \frac{\sum_{D \setminus \{Y, C\}} \prod_{i = 1}^l UQ[D_i]}{P(c)}
\]
- **Alternative Intervention Bounds**:
\[
L_w = \max_{W \in D} \min_{w^* \in W} \frac{P_{x,w^*}(y, c)}{P(c)}
\]
\[
U_w = \min_{W \in D} \max_{w^* \in W} \frac{P_{x,w^*}(y, c)}{P(c)}
\]
Through these methods, the paper provides a systematic solution that can effectively improve the performance of the multi - armed bandit algorithm in offline data with complex biases.