Abstract:In this work, we consider the problem of model selection for deep reinforcement learning (RL) in real-world environments. Typically, the performance of deep RL algorithms is evaluated via on-policy interactions with the target environment. However, comparing models in a real-world environment for the purposes of early stopping or hyperparameter tuning is costly and often practically infeasible. This leads us to examine off-policy policy evaluation (OPE) in such settings. We focus on OPE for value-based methods, which are of particular interest in deep RL, with applications like robotics, where off-policy algorithms based on Q-function estimation can often attain better sample complexity than direct policy optimization. Existing OPE metrics either rely on a model of the environment, or the use of importance sampling (IS) to correct for the data being off-policy. However, for high-dimensional observations, such as images, models of the environment can be difficult to fit and value-based methods can make IS hard to use or even ill-conditioned, especially when dealing with continuous action spaces. In this paper, we focus on the specific case of MDPs with continuous action spaces and sparse binary rewards, which is representative of many important real-world applications. We propose an alternative metric that relies on neither models nor IS, by framing OPE as a positive-unlabeled (PU) classification problem with the Q-function as the decision function. We experimentally show that this metric outperforms baselines on a number of tasks. Most importantly, it can reliably predict the relative performance of different policies in a number of generalization scenarios, including the transfer to the real-world of policies trained in simulation for an image-based robotic manipulation task.

Off-Policy Evaluation and Learning for External Validity under a Covariate Shift

Triply Robust Off-Policy Evaluation

More Efficient Off-Policy Evaluation through Regularized Targeted Learning

Off-Policy Evaluation of Bandit Algorithm from Dependent Samples under Batch Update Policy

Distributionally Robust Policy Evaluation under General Covariate Shift in Contextual Bandits

A Practical Guide of Off-Policy Evaluation for Bandit Problems

Off-Policy Evaluation Using Information Borrowing and Context-Based Switching

Off-policy Evaluation in Infinite-Horizon Reinforcement Learning with Latent Confounders

Counterfactual Learning with General Data-generating Policies

Off-Policy Evaluation in Doubly Inhomogeneous Environments

Conformal Off-Policy Evaluation in Markov Decision Processes

Effective Off-Policy Evaluation and Learning in Contextual Combinatorial Bandits

Doubly-Robust Off-Policy Evaluation with Estimated Logging Policy

Quantile Off-Policy Evaluation via Deep Conditional Generative Learning

Off-Policy Evaluation for Large Action Spaces via Conjunct Effect Modeling

Off-Policy Evaluation via Off-Policy Classification

When is Off-Policy Evaluation (Reward Modeling) Useful in Contextual Bandits? A Data-Centric Perspective

Estimating Model Performance Under Covariate Shift Without Labels

Adapting to Continuous Covariate Shift via Online Density Ratio Estimation

Off-policy evaluation beyond overlap: partial identification through smoothness