$Δ\text{-}{\rm OPE}$: Off-Policy Estimation with Pairs of Policies

Olivier Jeunen,Aleksei Ustimenko
2024-09-16
Abstract:The off-policy paradigm casts recommendation as a counterfactual decision-making task, allowing practitioners to unbiasedly estimate online metrics using offline data. This leads to effective evaluation metrics, as well as learning procedures that directly optimise online success. Nevertheless, the high variance that comes with unbiasedness is typically the crux that complicates practical applications. An important insight is that the difference between policy values can often be estimated with significantly reduced variance, if said policies have positive covariance. This allows us to formulate a pairwise off-policy estimation task: $\Delta\text{-}{\rm OPE}$. $\Delta\text{-}{\rm OPE}$ subsumes the common use-case of estimating improvements of a learnt policy over a production policy, using data collected by a stochastic logging policy. We introduce $\Delta\text{-}{\rm OPE}$ methods based on the widely used Inverse Propensity Scoring estimator and its extensions. Moreover, we characterise a variance-optimal additive control variate that further enhances efficiency. Simulated, offline, and online experiments show that our methods significantly improve performance for both evaluation and learning tasks.
Machine Learning,Information Retrieval
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is the high - variance problem in Off - Policy Estimation (OPE). Specifically, the author points out that in recommendation systems, OPE methods allow for unbiased estimation of online metrics using offline data, but these unbiased estimates are usually accompanied by high variance, resulting in overly wide confidence intervals and being unable to provide useful information for practical decision - making. To address this challenge, the paper introduces a new method - Paired Off - Policy Estimation (Δ - OPE). The main idea is to reduce variance by estimating the difference between two policy values. When these two policies have a positive covariance, this method can significantly reduce the estimated variance, thereby obtaining a tighter confidence interval. This not only improves the power of statistical tests (reduces Type II errors) but also improves the recommendation policy in counterfactual learning scenarios. Specifically, the paper proposes the following innovations: 1. **Δ - OPE Task**: Extend the common counterfactual estimation of a single policy value to the estimation of paired policy values. 2. **Methods Based on Inverse Propensity Score (IPS) and Its Extensions**: Propose multiple estimators for Δ - OPE, including Δ - IPS, Δ - SNIPS, and Δβ - IPS. 3. **Optimal Variance - Minimizing Control Variable**: Derive a global additive control variable to further improve efficiency. Through simulation experiments, offline experiments, and online A/B tests, the paper verifies the effectiveness of the Δ - OPE method in evaluation and learning tasks and demonstrates its potential in practical applications. ### Formula Summary - Traditional IPS Estimator: \[ \hat{V}_{\text{IPS}}(\pi_t, D)=\frac{1}{|D|} \sum_{(x, a, r) \in D} \frac{\pi_t(a|x)}{\pi_0(a|x)} r \] - Δ - IPS Estimator: \[ \hat{V}_{\Delta-\text{IPS}}(\pi_t, \pi_p, D)=\frac{1}{|D|} \sum_{(x, a, r) \in D}\left(\frac{\pi_t(a|x)-\pi_p(a|x)}{\pi_0(a|x)}\right) r \] - Δ - SNIPS Estimator: \[ \hat{V}_{\Delta-\text{SNIPS}}(\pi_t, \pi_p, D)=\frac{\sum_{(x, a, r) \in D} \frac{\pi_t(a|x)}{\pi_0(a|x)} r}{\sum_{(x, a, r) \in D} \frac{\pi_t(a|x)}{\pi_0(a|x)}}-\frac{\sum_{(x, a, r) \in D} \frac{\pi_p(a|x)}{\pi_0(a|x)} r}{\sum_{(x, a, r) \in D} \frac{\pi_p(a|x)}{\pi_0(a|x)}} \] - Δβ - IPS Estimator: \[ \hat{V}_{\Delta\beta-\text{IPS}}(\pi_t, \pi_p, D)=\frac{1}{|D|} \sum_{(x, a, r) \in D}\left(\frac{\pi_t(a|x)-\pi_p(a|x)}{\pi_0(a|x)}\right)(r - \beta) \] where the optimal baseline \(\beta^*\) is: \[ \beta^*=\frac{\mathbb{E}_{a \sim \pi_0}\left[\left(\frac{\pi_t(a|x)-\pi_p(a|x)}{\pi_0(a|x)}\right)^2 r\right]}{\mathbb{E}_{a \sim \pi_0}\left[\left(\frac{\pi_t(a|x)-\pi_p(a|x)}{\pi_0(a|x)}\right)\right]}