Abstract:To accumulate knowledge and improve its policy of behaviour, a reinforcement learning agent can learn `off-policy' about policies that differ from the policy used to generate its experience. This is important to learn counterfactuals, or because the experience was generated out of its own control. However, off-policy learning is non-trivial, and standard reinforcement-learning algorithms can be unstable and divergent. In this paper we discuss a novel family of off-policy prediction algorithms which are convergent by construction. The idea is to first learn on-policy about the data-generating behaviour, and then bootstrap an off-policy value estimate on this on-policy estimate, thereby constructing a value estimate that is partially off-policy. This process can be repeated to build a chain of value functions, each time bootstrapping a new estimate on the previous estimate in the chain. Each step in the chain is stable and hence the complete algorithm is guaranteed to be stable. Under mild conditions this comes arbitrarily close to the off-policy TD solution when we increase the length of the chain. Hence it can compute the solution even in cases where off-policy TD diverges. We prove that the proposed scheme is convergent and corresponds to an iterative decomposition of the inverse key matrix. Furthermore it can be interpreted as estimating a novel objective -- that we call a `k-step expedition' -- of following the target policy for finitely many steps before continuing indefinitely with the behaviour policy. Empirically we evaluate the idea on challenging MDPs such as Baird's counter example and observe favourable results.

Efficient Reinforcement Learning in Deterministic Systems with Value Function Generalization

Provably Efficient Reinforcement Learning Via Surprise Bound

Optimistic Value Instructors for Cooperative Multi-Agent Reinforcement Learning

Randomized Exploration for Reinforcement Learning with General Value Function Approximation

Nearly Minimax Optimal Reward-free Reinforcement Learning.

Generalized linear function approximation

Optimism in Reinforcement Learning with Generalized Linear Function Approximation.

Provably Efficient Reinforcement Learning with General Value Function Approximation.

Adaptive Exploration for Data-Efficient General Value Function Evaluations

Guarantees for Epsilon-Greedy Reinforcement Learning with Function Approximation

Value Function Optimistic Initialization with Uncertainty and Confidence Awareness in Lifelong Reinforcement Learning

A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation

Near-optimal Conservative Exploration in Reinforcement Learning under Episode-wise Constraints

Efficient Exploration in Continuous-time Model-based Reinforcement Learning

Reinforcement Learning from Partial Observation: Linear Function Approximation with Provable Sample Efficiency

Off-policy Evaluation in Infinite-Horizon Reinforcement Learning with Latent Confounders

Chaining Value Functions for Off-Policy Learning

Generalizing Across Multi-Objective Reward Functions in Deep Reinforcement Learning

Conservative Exploration for Policy Optimization via Off-Policy Policy Evaluation

Provably Efficient Exploration in Policy Optimization

Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation.