Abstract:In this work, we consider the problem of model selection for deep reinforcement learning (RL) in real-world environments. Typically, the performance of deep RL algorithms is evaluated via on-policy interactions with the target environment. However, comparing models in a real-world environment for the purposes of early stopping or hyperparameter tuning is costly and often practically infeasible. This leads us to examine off-policy policy evaluation (OPE) in such settings. We focus on OPE for value-based methods, which are of particular interest in deep RL, with applications like robotics, where off-policy algorithms based on Q-function estimation can often attain better sample complexity than direct policy optimization. Existing OPE metrics either rely on a model of the environment, or the use of importance sampling (IS) to correct for the data being off-policy. However, for high-dimensional observations, such as images, models of the environment can be difficult to fit and value-based methods can make IS hard to use or even ill-conditioned, especially when dealing with continuous action spaces. In this paper, we focus on the specific case of MDPs with continuous action spaces and sparse binary rewards, which is representative of many important real-world applications. We propose an alternative metric that relies on neither models nor IS, by framing OPE as a positive-unlabeled (PU) classification problem with the Q-function as the decision function. We experimentally show that this metric outperforms baselines on a number of tasks. Most importantly, it can reliably predict the relative performance of different policies in a number of generalization scenarios, including the transfer to the real-world of policies trained in simulation for an image-based robotic manipulation task.

Information-Directed Policy Search in Sparse-Reward Settings Via the Occupancy Information Ratio.

Occupancy Information Ratio: Infinite-Horizon, Information-Directed, Parameterized Policy Search

Efficient Reinforcement Learning via Decoupling Exploration and Utilization

Information Directed Reward Learning for Reinforcement Learning

OMPO: A Unified Framework for RL under Policy and Dynamics Shifts

AIBPO: Combine the Intrinsic Reward and Auxiliary Task for 3D Strategy Game

Efficient Exploration in Resource-Restricted Reinforcement Learning

A Scalable Derivative-free Exploration Approach for Reinforcement Learning

Off-Policy Evaluation via Off-Policy Classification

STEERING: Stein Information Directed Exploration for Model-Based Reinforcement Learning

Off-Policy Deep Reinforcement Learning with Analogous Disentangled Exploration

How does Your RL Agent Explore? An Optimal Transport Analysis of Occupancy Measure Trajectories

OPAC: Opportunistic Actor-Critic

DEIR: Efficient and Robust Exploration through Discriminative-Model-Based Episodic Intrinsic Rewards

RVI-SAC: Average Reward Off-Policy Deep Reinforcement Learning

Inverse Reinforcement Learning with Explicit Policy Estimates

Information-Directed Exploration for Deep Reinforcement Learning

AlignIQL: Policy Alignment in Implicit Q-Learning through Constrained Optimization

MADE: Exploration via Maximizing Deviation from Explored Regions

Pareto Inverse Reinforcement Learning for Diverse Expert Policy Generation

R-AIF: Solving Sparse-Reward Robotic Tasks from Pixels with Active Inference and World Models