Abstract:In many domains, the exploration process of reinforcement learning will be too costly as it requires trying out suboptimal policies, resulting in a need for off-policy evaluation, in which a target policy is evaluated based on data collected from a known behaviour policy. In this context, importance sampling estimators provide estimates for the expected return by weighting the trajectory based on the probability ratio of the target policy and the behaviour policy. Unfortunately, such estimators have a high variance and therefore a large mean squared error. This paper proposes state-based importance sampling estimators which reduce the variance by dropping certain states from the computation of the importance weight. To illustrate their applicability, we demonstrate state-based variants of ordinary importance sampling, weighted importance sampling, per-decision importance sampling, incremental importance sampling, doubly robust off-policy evaluation, and stationary density ratio estimation. Experiments in four domains show that state-based methods consistently yield reduced variance and improved accuracy compared to their traditional counterparts.

What problem does this paper attempt to address?

The paper attempts to address the problem of how to effectively conduct off-policy evaluation in reinforcement learning when the cost of exploration is too high (e.g., in fields such as medical interventions, financial decision-making, or safe navigation). Specifically, the paper focuses on how to reduce the variance of importance sampling (IS) estimators to improve their performance in long decision processes. Traditional importance sampling methods suffer from the "curse of horizon," where the variance grows exponentially with the time horizon of the decision process, leading to poor performance. To address this issue, the paper proposes a state-based importance sampling (SIS) method that reduces variance by excluding certain "negligible states" from the importance weight calculations. The main contributions of the paper include: 1. **Proposing a class of state-based importance sampling estimators** that reduce variance in importance weight calculations by excluding certain states. 2. **Providing two methods for identifying negligible states**, one based on covariance testing and the other based on state-action values (Q-values). 3. **Implementing state-based variants of several traditional estimators**, including ordinary importance sampling, weighted importance sampling, per-decision importance sampling, incremental importance sampling, doubly robust off-policy evaluation, and stabilized density ratio estimation. 4. **Validating the performance of these state-based estimators through experiments in four domains**, showing that they outperform traditional estimators in terms of variance reduction and accuracy improvement. Overall, the paper aims to address the high variance problem in off-policy evaluation by introducing state-based importance sampling methods, thereby improving the accuracy and reliability of evaluations in domains requiring long-term planning.

Low Variance Off-policy Evaluation with State-based Importance Sampling

Importance Sampling Policy Evaluation with an Estimated Behavior Policy

Variance Analysis of Multiple Importance Sampling Schemes

Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling

Efficient Policy Evaluation with Safety Constraint for Reinforcement Learning

Robust On-Policy Sampling for Data-Efficient Policy Evaluation in Reinforcement Learning

On the Reuse Bias in Off-Policy Reinforcement Learning

Policy Optimization Through Approximate Importance Sampling

A Deep Reinforcement Learning Approach to Rare Event Estimation

Nonasymptotic Bounds for Suboptimal Importance Sampling

Importance Sampled Stochastic Optimization for Variational Inference

Improving Importance Sampling Method in Structural Reliability

Variance-Reduced Off-Policy Memory-Efficient Policy Search

Optimal Mixture Weights for Off-Policy Evaluation with Multiple Behavior Policies

Efficient Multi-Policy Evaluation for Reinforcement Learning

Importance sampling for online variational learning

Policy Optimization via Importance Sampling

Independence-aware Advantage Estimation

Beyond ELBOs: A Large-Scale Evaluation of Variational Methods for Sampling

Mean-Semivariance Policy Optimization via Risk-Averse Reinforcement Learning

Variance Reduction based Partial Trajectory Reuse to Accelerate Policy Gradient Optimization