Abstract:Designing efficient exploration is central to Reinforcement Learning due to the fundamental problem posed by the exploration-exploitation dilemma. Bayesian exploration strategies like Thompson Sampling resolve this trade-off in a principled way by modeling and updating the distribution of the parameters of the action-value function, the outcome model of the environment. However, this technique becomes infeasible for complex environments due to the computational intractability of maintaining probability distributions over parameters of outcome models of corresponding complexity. Moreover, the approximation techniques introduced to mitigate this issue typically result in poor exploration-exploitation trade-offs, as observed in the case of deep neural network models with approximate posterior methods that have been shown to underperform in the deep bandit scenario. In this paper we introduce Sample Average Uncertainty (SAU), a simple and efficient uncertainty measure for contextual bandits. While Bayesian approaches like Thompson Sampling estimate outcomes uncertainty indirectly by first quantifying the variability over the parameters of the outcome model, SAU is a frequentist approach that directly estimates the uncertainty of the outcomes based on the value predictions. Importantly, we show theoretically that the uncertainty measure estimated by SAU asymptotically matches the uncertainty provided by Thompson Sampling, as well as its regret bounds. Because of its simplicity SAU can be seamlessly applied to deep contextual bandits as a very scalable drop-in replacement for epsilon-greedy exploration. We confirm empirically our theory by showing that SAU-based exploration outperforms current state-of-the-art deep Bayesian bandit methods on several real-world datasets at modest computation cost. Code is available at \url{<a class="link-external link-https" href="https://github.com/ibm/sau-explore" rel="external noopener nofollow">this https URL</a>}.

Gain-based Exploration: from Multi-armed Bandits to Partially Observable Environments.

Principal-Agent Bandit Games with Self-Interested and Exploratory Learning Agents

Learning to Explore with Lagrangians for Bandits under Unknown Linear Constraints

Dynamic Subgoal-based Exploration via Bayesian Optimization

Careful at Estimation and Bold at Exploration

Option-based Multi-agent Exploration

Regret Bounds and Reinforcement Learning Exploration of EXP-based Algorithms

Deterministic Exploration via Stationary Bellman Error Maximization

Disentangling Exploration from Exploitation

Incentivized Exploration of Non-Stationary Stochastic Bandits

Reward Maximization for Pure Exploration: Minimax Optimal Good Arm Identification for Nonparametric Multi-Armed Bandits

Bayesian Incentive-Compatible Bandit Exploration

Beyond Optimism: Exploration With Partially Observable Rewards

MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization

Behind the Myth of Exploration in Policy Gradients

Multi-task Representation Learning for Pure Exploration in Bilinear Bandits

Influence-Based Multi-Agent Exploration

Never Give Up: Learning Directed Exploration Strategies

Exploration in Feature Space for Reinforcement Learning

Deep Bandits Show-Off: Simple and Efficient Exploration with Deep Networks

Fair Exploration via Axiomatic Bargaining