Abstract:A/B tests, also known as randomized controlled experiments (RCTs), are the gold standard for evaluating the impact of new policies, products, or decisions. However, these tests can be costly in terms of time and resources, potentially exposing users, customers, or other test subjects (units) to inferior options. This paper explores practical considerations in applying methodologies inspired by "synthetic control" as an alternative to traditional A/B testing in settings with very large numbers of units, involving up to hundreds of millions of units, which is common in modern applications such as e-commerce and ride-sharing platforms. This method is particularly valuable in settings where the treatment affects only a subset of units, leaving many units unaffected. In these scenarios, synthetic control methods leverage data from unaffected units to estimate counterfactual outcomes for treated units. After the treatment is implemented, these estimates can be compared to actual outcomes to measure the treatment effect. A key challenge in creating accurate counterfactual outcomes is interpolation bias, a well-documented phenomenon that occurs when control units differ significantly from treated units. To address this, we propose a two-phase approach: first using nearest neighbor matching based on unit covariates to select similar control units, then applying supervised learning methods suitable for high-dimensional data to estimate counterfactual outcomes. Testing using six large-scale experiments demonstrates that this approach successfully improves estimate accuracy. However, our analysis reveals that machine learning bias -- which arises from methods that trade off bias for variance reduction -- can impact results and affect conclusions about treatment effects. We document this bias in large-scale experimental settings and propose effective de-biasing techniques to address this challenge.

Short-lived High-volume Multi-A(rmed)/B(andits) Testing

Sequential Optimum Test with Multi-armed Bandits for Online Experimentation

Multi-Armed Bandits with Interference

Accelerated learning from recommender systems using multi-armed bandit

Multi-Armed Bandit Strategies for Non-Stationary Reward Distributions and Delayed Feedback Processes

Adaptive Sequential Experiments with Unknown Information Arrival Processes

Bayesian Online Multiple Testing: A Resource Allocation Approach

The Non-Bayesian Restless Multi-Armed Bandit: A Case of Near-Logarithmic Strict Regret

Speed Up the Cold-Start Learning in Two-Sided Bandits with Many Arms

The Non-Bayesian Restless Multi-Armed Bandit: a Case of Near-Logarithmic Regret

Post Launch Evaluation of Policies in a High-Dimensional Setting

Misalignment, Learning, and Ranking: Harnessing Users Limited Attention

A/B Testing and Best-arm Identification for Linear Bandits with Robustness to Non-stationarity

Low-rank Bandits with Latent Mixtures

Multi-Armed Bandits with Network Interference

Offline Planning and Online Learning Under Recovering Rewards

Multiarmed Bandits Problem Under the Mean-Variance Setting

Evaluating Online Bandit Exploration In Large-Scale Recommender System

Non-stationary Bandits with Habituation and Recovery Dynamics and Knapsack Constraints

Adapting multi-armed bandits policies to contextual bandits scenarios

Reward Maximization for Pure Exploration: Minimax Optimal Good Arm Identification for Nonparametric Multi-Armed Bandits