Abstract:Directed Exploration is a crucial challenge in reinforcement learning (RL), especially when rewards are sparse. Information-directed sampling (IDS), which optimizes the information ratio, seeks to do so by augmenting regret with information gain. However, estimating information gain is computationally intractable or relies on restrictive assumptions which prohibit its use in many practical instances. In this work, we posit an alternative exploration incentive in terms of the integral probability metric (IPM) between a current estimate of the transition model and the unknown optimal, which under suitable conditions, can be computed in closed form with the kernelized Stein discrepancy (KSD). Based on KSD, we develop a novel algorithm \algo: \textbf{STE}in information dir\textbf{E}cted exploration for model-based \textbf{R}einforcement Learn\textbf{ING}. To enable its derivation, we develop fundamentally new variants of KSD for discrete conditional distributions. {We further establish that {\algo} archives sublinear Bayesian regret, improving upon prior learning rates of information-augmented MBRL.} Experimentally, we show that the proposed algorithm is computationally affordable and outperforms several prior approaches.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to explore effectively in reinforcement learning, especially in the case of sparse rewards. Specifically, the paper focuses on how to design an effective exploration strategy in model - based reinforcement learning (MBRL) to overcome the problems that existing methods are computationally infeasible or the exploration direction is not clear. Although the traditional Information - Directed Sampling (IDS) method has good performance in theory, it faces two main challenges in practical applications: 1. **Computational infeasibility**: Estimating the information gain is very computationally complex, especially in high - dimensional or large - scale environments. 2. **Non - directed exploration**: Existing IDS methods focus more on collecting all information about the environment rather than exploring directly towards the optimal transition dynamics. To solve these problems, the paper proposes a new exploration incentive mechanism - Stein information gain, and achieves this goal through Kernelized Stein Discrepancy (KSD). Stein information gain guides exploration by calculating the Integral Probability Metric (IPM) between the currently estimated transition model and the true but unknown transition model, thereby achieving directed exploration. This method is not only more computationally feasible but also can approach the optimal transition dynamics more effectively. ### Main contributions of the paper 1. **Formalizes the Bayesian regret in model - based phased RL and introduces KSD to measure the distance from the real MDP, thus achieving directed exploration**. 2. **Introduces the Discrete Conditional KSD (DSD) in tabular RL for the first time to analyze the distribution distance**. 3. **Proposes the STEERING algorithm, which optimizes the policy by minimizing the Stein information ratio, thereby achieving efficient directed exploration**. 4. **Establishes the prior - free sub - linear Bayesian regret bound of STEERING and provides conditions for further improving the regret bound under certain regularity conditions**. 5. **Verifies the performance of the STEERING algorithm in the sparse - reward setting through extensive experiments, showing that it is superior to existing methods in efficient directed exploration**. ### Specific technical details - **Kernelized Stein Discrepancy (KSD)**: KSD is a method for measuring the similarity between two distributions, especially suitable for evaluating the quality of the posterior distribution when the target distribution is unknown. The paper calculates KSD by defining the Stein operator and Stein kernel, thereby achieving computational feasibility. - **Discrete Conditional KSD (DSD)**: To adapt to the discrete state and action spaces in tabular RL, the paper introduces the discrete conditional KSD, a new metric that can effectively evaluate the distance between conditional distributions. - **STEERING algorithm**: This algorithm optimizes the policy by minimizing the Stein information ratio, thereby exploring towards the optimal transition dynamics at each step. The algorithm also includes an intelligent sample selection process to reduce redundant samples and increase the convergence speed. Through these innovations, the paper provides a new method for achieving efficient directed exploration in model - based reinforcement learning, especially performing well in environments with sparse rewards.

STEERING: Stein Information Directed Exploration for Model-Based Reinforcement Learning

Inverse Reinforcement Learning with Unknown Reward Model based on Structural Risk Minimization

Information-Directed Exploration for Deep Reinforcement Learning

Information Directed Reward Learning for Reinforcement Learning

Exploration in Feature Space for Reinforcement Learning

Reward Uncertainty for Exploration in Preference-based Reinforcement Learning

Satisficing Exploration for Deep Reinforcement Learning

Learning to Steer Markovian Agents under Model Uncertainty

LiFE:Deep Exploration Via Linear-Feature Bonus in Continuous Control

Exploration in Model-based Reinforcement Learning with Randomized Reward

Efficient Exploration in Continuous-time Model-based Reinforcement Learning

Randomized Exploration for Reinforcement Learning with General Value Function Approximation

DEIR: Efficient and Robust Exploration through Discriminative-Model-Based Episodic Intrinsic Rewards

Efficient and Stable Information Directed Exploration for Continuous Reinforcement Learning

Effective Reinforcement Learning Based on Structural Information Principles

Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo

Effective Exploration Based on the Structural Information Principles

Dynamic Subgoal-based Exploration via Bayesian Optimization

Regret Bounds and Reinforcement Learning Exploration of EXP-based Algorithms

MADE: Exploration via Maximizing Deviation from Explored Regions

Never Give Up: Learning Directed Exploration Strategies