Abstract:The off-policy paradigm casts recommendation as a counterfactual decision-making task, allowing practitioners to unbiasedly estimate online metrics using offline data. This leads to effective evaluation metrics, as well as learning procedures that directly optimise online success. Nevertheless, the high variance that comes with unbiasedness is typically the crux that complicates practical applications. An important insight is that the difference between policy values can often be estimated with significantly reduced variance, if said policies have positive covariance. This allows us to formulate a pairwise off-policy estimation task: $\Delta\text{-}{\rm OPE}$. $\Delta\text{-}{\rm OPE}$ subsumes the common use-case of estimating improvements of a learnt policy over a production policy, using data collected by a stochastic logging policy. We introduce $\Delta\text{-}{\rm OPE}$ methods based on the widely used Inverse Propensity Scoring estimator and its extensions. Moreover, we characterise a variance-optimal additive control variate that further enhances efficiency. Simulated, offline, and online experiments show that our methods significantly improve performance for both evaluation and learning tasks.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is the high - variance problem in Off - Policy Estimation (OPE). Specifically, the author points out that in recommendation systems, OPE methods allow for unbiased estimation of online metrics using offline data, but these unbiased estimates are usually accompanied by high variance, resulting in overly wide confidence intervals and being unable to provide useful information for practical decision - making. To address this challenge, the paper introduces a new method - Paired Off - Policy Estimation (Δ - OPE). The main idea is to reduce variance by estimating the difference between two policy values. When these two policies have a positive covariance, this method can significantly reduce the estimated variance, thereby obtaining a tighter confidence interval. This not only improves the power of statistical tests (reduces Type II errors) but also improves the recommendation policy in counterfactual learning scenarios. Specifically, the paper proposes the following innovations: 1. **Δ - OPE Task**: Extend the common counterfactual estimation of a single policy value to the estimation of paired policy values. 2. **Methods Based on Inverse Propensity Score (IPS) and Its Extensions**: Propose multiple estimators for Δ - OPE, including Δ - IPS, Δ - SNIPS, and Δβ - IPS. 3. **Optimal Variance - Minimizing Control Variable**: Derive a global additive control variable to further improve efficiency. Through simulation experiments, offline experiments, and online A/B tests, the paper verifies the effectiveness of the Δ - OPE method in evaluation and learning tasks and demonstrates its potential in practical applications. ### Formula Summary - Traditional IPS Estimator: \[ \hat{V}_{\text{IPS}}(\pi_t, D)=\frac{1}{|D|} \sum_{(x, a, r) \in D} \frac{\pi_t(a|x)}{\pi_0(a|x)} r \] - Δ - IPS Estimator: \[ \hat{V}_{\Delta-\text{IPS}}(\pi_t, \pi_p, D)=\frac{1}{|D|} \sum_{(x, a, r) \in D}\left(\frac{\pi_t(a|x)-\pi_p(a|x)}{\pi_0(a|x)}\right) r \] - Δ - SNIPS Estimator: \[ \hat{V}_{\Delta-\text{SNIPS}}(\pi_t, \pi_p, D)=\frac{\sum_{(x, a, r) \in D} \frac{\pi_t(a|x)}{\pi_0(a|x)} r}{\sum_{(x, a, r) \in D} \frac{\pi_t(a|x)}{\pi_0(a|x)}}-\frac{\sum_{(x, a, r) \in D} \frac{\pi_p(a|x)}{\pi_0(a|x)} r}{\sum_{(x, a, r) \in D} \frac{\pi_p(a|x)}{\pi_0(a|x)}} \] - Δβ - IPS Estimator: \[ \hat{V}_{\Delta\beta-\text{IPS}}(\pi_t, \pi_p, D)=\frac{1}{|D|} \sum_{(x, a, r) \in D}\left(\frac{\pi_t(a|x)-\pi_p(a|x)}{\pi_0(a|x)}\right)(r - \beta) \] where the optimal baseline $\beta^*$ is: \[ \beta^*=\frac{\mathbb{E}_{a \sim \pi_0}\left[\left(\frac{\pi_t(a|x)-\pi_p(a|x)}{\pi_0(a|x)}\right)^2 r\right]}{\mathbb{E}_{a \sim \pi_0}\left[\left(\frac{\pi_t(a|x)-\pi_p(a|x)}{\pi_0(a|x)}\right)\right]}

$Δ\text{-}{\rm OPE}$: Off-Policy Estimation with Pairs of Policies

Design from Policies: Conservative Test-Time Adaptation for Offline Policy Optimization

Beyond Reward: Offline Preference-guided Policy Optimization

OPERA: Automatic Offline Policy Evaluation with Re-weighted Aggregates of Multiple Estimators

Off-Policy Evaluation via Off-Policy Classification

More Efficient Off-Policy Evaluation through Regularized Targeted Learning

Automated Off-Policy Estimator Selection via Supervised Learning

Off-policy evaluation beyond overlap: partial identification through smoothness

Off-Policy Evaluation Using Information Borrowing and Context-Based Switching

Off-Policy Evaluation in Doubly Inhomogeneous Environments

Doubly-Robust Off-Policy Evaluation with Estimated Logging Policy

Quantile Off-Policy Evaluation via Deep Conditional Generative Learning

Off-Policy Exploitability-Evaluation in Two-Player Zero-Sum Markov Games

Doubly Robust Distributionally Robust Off-Policy Evaluation and Learning

Primal-Dual Spectral Representation for Off-policy Evaluation

Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling

Offline Policy Evaluation in Large Action Spaces Via Outcome-Oriented Action Grouping

IntOPE: Off-Policy Evaluation in the Presence of Interference

Probabilistic Offline Policy Ranking with Approximate Bayesian Computation

Concept-driven Off Policy Evaluation

Distributional Off-policy Evaluation with Bellman Residual Minimization