A Practical Guide of Off-Policy Evaluation for Bandit Problems

Masahiro Kato,Kenshi Abe,Kaito Ariu,Shota Yasui
DOI: https://doi.org/10.48550/arXiv.2010.12470
2020-10-23
Abstract:Off-policy evaluation (OPE) is the problem of estimating the value of a target policy from samples obtained via different policies. Recently, applying OPE methods for bandit problems has garnered attention. For the theoretical guarantees of an estimator of the policy value, the OPE methods require various conditions on the target policy and policy used for generating the samples. However, existing studies did not carefully discuss the practical situation where such conditions hold, and the gap between them remains. This paper aims to show new results for bridging the gap. Based on the properties of the evaluation policy, we categorize OPE situations. Then, among practical applications, we mainly discuss the best policy selection. For the situation, we propose a meta-algorithm based on existing OPE estimators. We investigate the proposed concepts using synthetic and open real-world datasets in experiments.
Machine Learning,Econometrics
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is **the gap between the practical application and theoretical guarantees of Off - Policy Evaluation (OPE) in Bandit Problems**. Specifically, the paper focuses on the following aspects: 1. **Applicable conditions of OPE methods**: Existing OPE methods usually require some assumptions to ensure the unbiasedness, consistency, and asymptotic normality of estimators. However, these conditions are often difficult to meet in practical applications, resulting in a gap between theory and practice. The paper aims to explore the rationality of these conditions in practical applications and propose new results to bridge this gap. 2. **Term confusion problem**: The paper points out that there is a problem of term confusion in the OPE field, which may lead to the misuse of methods. For example, the evaluation probabilities used in some studies are not deterministic, which may affect the reproducibility of experimental results. The paper emphasizes the importance of distinguishing between behavior policies and evaluation policies and discusses the possible misleadings caused by improper use of terms. 3. **Best Policy Selection**: The paper pays special attention to how to select the optimal evaluation policy in practical applications. To this end, the author proposes a meta - algorithm based on existing OPE estimators and conducts experimental verification through synthetic data and real - world data sets. ### Specific problem summary - **Theoretical requirements of OPE methods**: Existing OPE methods usually require that certain conditions (such as independence) be met between the evaluation policy and the behavior policy used to generate samples, but these conditions are often not valid in practical applications. - **Term consistency**: The paper points out that the inconsistent use of terms in the OPE field may lead to the misuse of methods. For example, the evaluation probabilities used in some studies are not deterministic, which may affect the reproducibility of experimental results. - **Best policy selection in practical applications**: The paper explores how to select the optimal evaluation policy in practical applications and proposes a meta - algorithm based on existing OPE estimators. ### Main contributions of the paper 1. **Summarize potential problems and limitations**: The paper summarizes the potential problems and limitations of OPE methods in practical applications. 2. **Sort out OPE terms**: The paper sorts out and clarifies the key terms in the OPE field. 3. **Classify OPE application situations**: According to the different characteristics of evaluation probabilities, the paper classifies the application situations of OPE methods. 4. **Prove new results**: The paper proposes some new theoretical results to bridge the gap between theory and practical applications. 5. **Experimental verification**: The paper verifies the proposed concepts and methods through synthetic data and real - world data sets. Through these works, the paper aims to provide a more solid theoretical basis and practical guidance for the practical application of OPE in Bandit Problems.