Optimizing Adaptive Experiments: A Unified Approach to Regret Minimization and Best-Arm Identification

Chao Qin,Daniel Russo
2024-07-30
Abstract:Practitioners conducting adaptive experiments often encounter two competing priorities: maximizing total welfare (or `reward') through effective treatment assignment and swiftly concluding experiments to implement population-wide treatments. Current literature addresses these priorities separately, with regret minimization studies focusing on the former and best-arm identification research on the latter. This paper bridges this divide by proposing a unified model that simultaneously accounts for within-experiment performance and post-experiment outcomes. We provide a sharp theory of optimal performance in large populations that not only unifies canonical results in the literature but also uncovers novel insights. Our theory reveals that familiar algorithms, such as the recently proposed top-two Thompson sampling algorithm, can optimize a broad class of objectives if a single scalar parameter is appropriately adjusted. In addition, we demonstrate that substantial reductions in experiment duration can often be achieved with minimal impact on both within-experiment and post-experiment regret.
Machine Learning,Econometrics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the contradiction between two main priorities in the Multi - Armed Bandit (MAB) problem: maximizing the total welfare (or "reward") through efficient treatment allocation, and quickly ending the experiment to implement treatment for the entire population. Specifically: 1. **Maximizing Total Welfare (Regret Minimization)**: The goal is to maximize the total welfare among all treated individuals, that is, to minimize the total regret due to sub - optimal treatment allocation. This usually involves adaptive and sequential treatment allocation during the experiment. 2. **Quickly Ending the Experiment and Deploying the Optimal Treatment (Best - Arm Identification)**: The goal is to identify the optimal treatment plan as soon as possible and apply it to the entire population. This usually involves minimizing the expected length of the experiment while ensuring that the correct treatment plan is selected with high probability. Current literature usually treats these two goals separately, studying how to minimize regret and how to quickly identify the optimal arm respectively. However, in practical applications, researchers often need to consider both aspects simultaneously. Therefore, this paper proposes a unified framework aiming to optimize both the performance within the experiment and the results after the experiment. ### Main Contributions of the Paper 1. **Unified Classical Theory**: - When the cost function is almost entirely determined by the regret of treatment allocation, the objective of this model is consistent with the research of Lai and Robbins [1985], and can recover their well - known optimal regret formula. - When the cost of each arm within the experiment is the same, the results of this model are consistent with the best - arm identification theory of Garivier and Kaufmann [2016]. 2. **Pareto Frontier between Experiment Length and Total Regret**: - The research reveals that in some cases, significantly reducing the experiment time can have a minor impact on the cumulative regret. - For more complex trade - off situations, an exact description of the Pareto frontier and its interpretable boundaries (Proposition 2) are provided. 3. **Adjustment of Popular Algorithms**: - By fine - tuning a single tuning parameter of the Top - Two Thompson Sampling algorithm, adaptive experiments can be optimized in very large populations. - This adjustment is not only applicable to the trade - off between specific experiment lengths and total regret, but also can optimize the generalized objective function. 4. **Nature of Asymptotically Efficient Strategies**: - The research shows that the properties of asymptotically efficient strategies are almost independent of the cost function per period. - The allocation of the best exploration effort is completely determined by a certain information balance property, which emphasizes that the strength of statistical evidence against sub - optimal alternatives should grow at an equal rate. ### Numerical Experiment Results The numerical experiments show the advantages of the Top - Two Thompson Sampling algorithm over the Epsilon - Greedy algorithm: - The experiment time is significantly shortened, and the cumulative regret is greatly reduced. - It is more effective in identifying and giving priority to potentially optimal arms, while reducing the measurement of obviously sub - optimal arms. In conclusion, this paper proposes a new method that can simultaneously optimize the performance within the experiment and the decision - making quality after the experiment in large - scale populations, thus providing theoretical support and practical guidance for adaptive experiments in practical applications.