Low Variance Off-policy Evaluation with State-based Importance Sampling

David M. Bossens,Philip S. Thomas
2024-05-04
Abstract:In many domains, the exploration process of reinforcement learning will be too costly as it requires trying out suboptimal policies, resulting in a need for off-policy evaluation, in which a target policy is evaluated based on data collected from a known behaviour policy. In this context, importance sampling estimators provide estimates for the expected return by weighting the trajectory based on the probability ratio of the target policy and the behaviour policy. Unfortunately, such estimators have a high variance and therefore a large mean squared error. This paper proposes state-based importance sampling estimators which reduce the variance by dropping certain states from the computation of the importance weight. To illustrate their applicability, we demonstrate state-based variants of ordinary importance sampling, weighted importance sampling, per-decision importance sampling, incremental importance sampling, doubly robust off-policy evaluation, and stationary density ratio estimation. Experiments in four domains show that state-based methods consistently yield reduced variance and improved accuracy compared to their traditional counterparts.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the problem of how to effectively conduct off-policy evaluation in reinforcement learning when the cost of exploration is too high (e.g., in fields such as medical interventions, financial decision-making, or safe navigation). Specifically, the paper focuses on how to reduce the variance of importance sampling (IS) estimators to improve their performance in long decision processes. Traditional importance sampling methods suffer from the "curse of horizon," where the variance grows exponentially with the time horizon of the decision process, leading to poor performance. To address this issue, the paper proposes a state-based importance sampling (SIS) method that reduces variance by excluding certain "negligible states" from the importance weight calculations. The main contributions of the paper include: 1. **Proposing a class of state-based importance sampling estimators** that reduce variance in importance weight calculations by excluding certain states. 2. **Providing two methods for identifying negligible states**, one based on covariance testing and the other based on state-action values (Q-values). 3. **Implementing state-based variants of several traditional estimators**, including ordinary importance sampling, weighted importance sampling, per-decision importance sampling, incremental importance sampling, doubly robust off-policy evaluation, and stabilized density ratio estimation. 4. **Validating the performance of these state-based estimators through experiments in four domains**, showing that they outperform traditional estimators in terms of variance reduction and accuracy improvement. Overall, the paper aims to address the high variance problem in off-policy evaluation by introducing state-based importance sampling methods, thereby improving the accuracy and reliability of evaluations in domains requiring long-term planning.