Powerful A/B-Testing Metrics and Where to Find Them

Olivier Jeunen,Shubham Baweja,Neeti Pokharna,Aleksei Ustimenko
2024-07-30
Abstract:Online controlled experiments, colloquially known as A/B-tests, are the bread and butter of real-world recommender system evaluation. Typically, end-users are randomly assigned some system variant, and a plethora of metrics are then tracked, collected, and aggregated throughout the experiment. A North Star metric (e.g. long-term growth or revenue) is used to assess which system variant should be deemed superior. As a result, most collected metrics are supporting in nature, and serve to either (i) provide an understanding of how the experiment impacts user experience, or (ii) allow for confident decision-making when the North Star metric moves insignificantly (i.e. a false negative or type-II error). The latter is not straightforward: suppose a treatment variant leads to fewer but longer sessions, with more views but fewer engagements; should this be considered a positive or negative outcome? The question then becomes: how do we assess a supporting metric's utility when it comes to decision-making using A/B-testing? Online platforms typically run dozens of experiments at any given time. This provides a wealth of information about interventions and treatment effects that can be used to evaluate metrics' utility for online evaluation. We propose to collect this information and leverage it to quantify type-I, type-II, and type-III errors for the metrics of interest, alongside a distribution of measurements of their statistical power (e.g. $z$-scores and $p$-values). We present results and insights from building this pipeline at scale for two large-scale short-video platforms: ShareChat and Moj; leveraging hundreds of past experiments to find online metrics with high statistical power.
Information Retrieval,Applications
What problem does this paper attempt to address?
The paper attempts to address the issue of how to evaluate and support the utility of supporting metrics in online A/B testing, particularly in making decisions when the North Star metric shows no significant change. Specifically, the paper explores the following points: 1. **Role of Supporting Metrics**: - Supporting metrics are typically used to understand the impact of experiments on user experience or to provide decision support when the North Star metric shows no significant change. - The paper points out that in practical applications, many experiments fail to significantly impact the North Star metric, which may be due to the low sensitivity of the metric. 2. **Challenges in Decision-Making**: - How to evaluate whether changes in user behavior caused by treatment variants (such as fewer but longer sessions, more views but fewer interactions, etc.) are positive or negative? - The paper proposes a method to evaluate the utility of supporting metrics in decision-making by quantifying their statistical power (such as z-scores and p-values). 3. **Methods and Contributions**: - The paper proposes a method to collect and analyze past A/B experiment data to quantify Type I, Type II, and Type III errors of supporting metrics. - Through this method, online metrics with high statistical power can be identified, thereby improving the quality and efficiency of decision-making in experiments. 4. **Empirical Results**: - Through experiments on two large-scale short video platforms (ShareChat and Moj), the paper demonstrates how combining multiple supporting metrics can reduce Type II errors or reduce the required sample size. - The experimental results show that using these sets of metrics can significantly improve statistical power or reduce experimental costs. In summary, the paper aims to systematically evaluate and optimize supporting metrics to improve the decision quality and experimental efficiency of A/B testing.