Abstract:A/B testing, also referred to as online controlled experimentation or continuous experimentation, is a form of hypothesis testing where two variants of a piece of software are compared in the field from an end user’s point of view. A/B testing is widely used in practice to enable data-driven decision making for software development. While a few studies have explored different facets of research on A/B testing, no comprehensive study has been conducted on the state-of-the-art in A/B testing. Such a study is crucial to provide a systematic overview of the field of A/B testing driving future research forward. To address this gap and provide an overview of the state-of-the-art in A/B testing, this paper reports the results of a systematic literature review that analyzed primary studies. The research questions focused on the subject of A/B testing, how A/B tests are designed and executed, what roles stakeholders have in this process, and the open challenges in the area. Analysis of the extracted data shows that the main targets of A/B testing are algorithms, visual elements, and workflow and processes. Single classic A/B tests are the dominating type of tests, primarily based in hypothesis tests. Stakeholders have three main roles in the design of A/B tests: concept designer, experiment architect, and setup technician. The primary types of data collected during the execution of A/B tests are product/system data, user-centric data, and spatio-temporal data. The dominating use of the test results are feature selection, feature rollout, continued feature development, and subsequent A/B test design. Stakeholders have two main roles during A/B test execution: experiment coordinator and experiment assessor. The main reported open problems are related to the enhancement of proposed approaches and their usability. From our study we derived three interesting lines for future research: strengthen the adoption of statistical methods in A/B testing, improving the process of A/B testing, and enhancing the automation of A/B testing.

Risk-aware product decisions in A/B tests with multiple metrics

Learning Metrics that Maximise Power for Accelerated A/B-Tests

Powerful A/B-Testing Metrics and Where to Find Them

Rapid and Scalable Bayesian AB Testing

Test Where Decisions Matter: Importance-driven Testing for Deep Reinforcement Learning

Ranking by Lifts: A Cost-Benefit Approach to Large-Scale A/B Tests

Equivalence Test in Multi-dimensional Space with Applications in A/B Testing

A/B testing: A systematic literature review

An Analysis of Switchback Designs in Reinforcement Learning

Online Experimentation with Surrogate Metrics: Guidelines and a Case Study

Empirical Bayes Multistage Testing for Large-Scale Experiments

Bayesian Sequentially Monitored Multi-arm Experiments with Multiple Comparison Adjustments

Short-lived High-volume Multi-A(rmed)/B(andits) Testing

A Set of Estimation and Decision Preference Experiments for Exploring Risk Assessment Biases in Engineering Students

ForTune: Running Offline Scenarios to Estimate Impact on Business Metrics

The assessment of affective decision‐making: Exploring alternative scoring methods for the Balloon Analog Risk Task and Columbia Card Task

Risk Aware Benchmarking of Large Language Models

Dynamic Causal Effects Evaluation in A/B Testing with a Reinforcement Learning Framework

Non-marginal Decisions: A Novel Bayesian Multiple Testing Procedure

On (assessing) the fairness of risk score models

Examining properness in the external validation of survival models with squared and logarithmic losses