Abstract:In large-scale online experimentation platforms, experimenters aim to discover the best treatment (arm) among multiple candidates. Traditional A/B testing and multi-armed bandits (MAB) algorithms are two popular designs. The former usually achieves a higher power but may hurt the customers' satisfaction when always recommending a poor arm, while the latter aims at improving the customers' experience (collecting more rewards) but faces the loss of testing power. Recently, [26] combine the advantage of A/B testing and MAB algorithms to maximize the testing power while maintaining more rewards for experiments with two-arm and Bernoulli rewards. However, in practice, the number of arms is usually larger than two and the reward type also varies. In multi-arm experiments, the required sample size to find the optimal arm blows up to guarantee a false discovery rate with the increase of arm numbers, bringing high opportunity costs to experimenters. To save the cost during the long experimental process, we propose a more efficient sequential test framework named Soptima that can work with general reward types. Inspired by the design of traditional MAB algorithms in chasing rewards and A/B testing in maximizing power, we propose an Elimination-type strategy adapted to this framework to dynamically adjust the traffic split on arms. This strategy cooperating with Soptima simultaneously maintains the advantage of the A/B testing in maximizing the testing power, the sequential test methods in saving the sample size, and the MAB algorithms in collecting rewards. The theoretical analysis gives guarantees on the Type-I, Type-II, and optimality error rates of the proposed approach. A series of experiments from both simulation and industrial historical data sets are conducted to verify the superiority of our approach compared with available baselines.

Anytime-Valid Confidence Sequences in an Enterprise A/B Testing Platform

Rapid and Scalable Bayesian AB Testing

YEAST: Yet Another Sequential Test

Powerful A/B-Testing Metrics and Where to Find Them

Empirical Bayes Multistage Testing for Large-Scale Experiments

Validation of massively-parallel adaptive testing using dynamic control matching

Large-Scale Online Experimentation with Quantile Metrics

Equivalence Test in Multi-dimensional Space with Applications in A/B Testing

Adaptive A/B Tests and Simultaneous Treatment Parameter Optimization

A framework for Multi-A(rmed)/B(andit) testing with online FDR control

Ranking by Lifts: A Cost-Benefit Approach to Large-Scale A/B Tests

Conducting A/B Experiments with a Scalable Architecture

Online Learning for Non-Stationary A/B Tests

An Online Sequential Test for Qualitative Treatment Effects

Best of Three Worlds: Adaptive Experimentation for Digital Marketing in Practice

Comparison Lift: Bandit-based Experimentation System for Online Advertising

Automated metrics calculation in a dynamic heterogeneous environment

All about sample-size calculations for A/B testing: Novel extensions and practical guide

Deep anytime-valid hypothesis testing

Bootstrap Matching: a robust and efficient correction for non-random A/B test, and its applications

Sequential Optimum Test with Multi-armed Bandits for Online Experimentation