Powerful A/B-Testing Metrics and Where to Find Them

Olivier Jeunen,Shubham Baweja,Neeti Pokharna,Aleksei Ustimenko

2024-07-30

Abstract:Online controlled experiments, colloquially known as A/B-tests, are the bread and butter of real-world recommender system evaluation. Typically, end-users are randomly assigned some system variant, and a plethora of metrics are then tracked, collected, and aggregated throughout the experiment. A North Star metric (e.g. long-term growth or revenue) is used to assess which system variant should be deemed superior. As a result, most collected metrics are supporting in nature, and serve to either (i) provide an understanding of how the experiment impacts user experience, or (ii) allow for confident decision-making when the North Star metric moves insignificantly (i.e. a false negative or type-II error). The latter is not straightforward: suppose a treatment variant leads to fewer but longer sessions, with more views but fewer engagements; should this be considered a positive or negative outcome? The question then becomes: how do we assess a supporting metric's utility when it comes to decision-making using A/B-testing? Online platforms typically run dozens of experiments at any given time. This provides a wealth of information about interventions and treatment effects that can be used to evaluate metrics' utility for online evaluation. We propose to collect this information and leverage it to quantify type-I, type-II, and type-III errors for the metrics of interest, alongside a distribution of measurements of their statistical power (e.g. $z$-scores and $p$-values). We present results and insights from building this pipeline at scale for two large-scale short-video platforms: ShareChat and Moj; leveraging hundreds of past experiments to find online metrics with high statistical power.

Information Retrieval,Applications

What problem does this paper attempt to address?

The paper attempts to address the issue of how to evaluate and support the utility of supporting metrics in online A/B testing, particularly in making decisions when the North Star metric shows no significant change. Specifically, the paper explores the following points: 1. **Role of Supporting Metrics**: - Supporting metrics are typically used to understand the impact of experiments on user experience or to provide decision support when the North Star metric shows no significant change. - The paper points out that in practical applications, many experiments fail to significantly impact the North Star metric, which may be due to the low sensitivity of the metric. 2. **Challenges in Decision-Making**: - How to evaluate whether changes in user behavior caused by treatment variants (such as fewer but longer sessions, more views but fewer interactions, etc.) are positive or negative? - The paper proposes a method to evaluate the utility of supporting metrics in decision-making by quantifying their statistical power (such as z-scores and p-values). 3. **Methods and Contributions**: - The paper proposes a method to collect and analyze past A/B experiment data to quantify Type I, Type II, and Type III errors of supporting metrics. - Through this method, online metrics with high statistical power can be identified, thereby improving the quality and efficiency of decision-making in experiments. 4. **Empirical Results**: - Through experiments on two large-scale short video platforms (ShareChat and Moj), the paper demonstrates how combining multiple supporting metrics can reduce Type II errors or reduce the required sample size. - The experimental results show that using these sets of metrics can significantly improve statistical power or reduce experimental costs. In summary, the paper aims to systematically evaluate and optimize supporting metrics to improve the decision quality and experimental efficiency of A/B testing.

Powerful A/B-Testing Metrics and Where to Find Them

Learning Metrics that Maximise Power for Accelerated A/B-Tests

Online Experimentation with Surrogate Metrics: Guidelines and a Case Study

Large-Scale Online Experimentation with Quantile Metrics

Variance Reduction in Ratio Metrics for Efficient Online Experiments

A/B Testing for Recommender Systems in a Two-sided Marketplace

How to Measure Your App: A Couple of Pitfalls and Remedies in Measuring App Performance in Online Controlled Experiments

Statistical Challenges in Online Controlled Experiments: A Review of A/B Testing Methodology

A Method for Measuring Network Effects of One-to-One Communication Features in Online A/B Tests

How A/B testing changes the dynamics of information spreading on a social network

A/B testing: A systematic literature review

Designing Experiments to Measure Incrementality on Facebook

Rapid and Scalable Bayesian AB Testing

Post Launch Evaluation of Policies in a High-Dimensional Setting

An Online Sequential Test for Qualitative Treatment Effects

All about sample-size calculations for A/B testing: Novel extensions and practical guide

Measuring e-Commerce Metric Changes in Online Experiments

Online Controlled Experiments for Personalised e-Commerce Strategies: Design, Challenges, and Pitfalls

Sequential Optimum Test with Multi-armed Bandits for Online Experimentation

Flexible Online Repeated Measures Experiment

Accelerated learning from recommender systems using multi-armed bandit