Abstract:A default assumption in the design of reinforcement-learning algorithms is that a decision-making agent always explores to learn optimal behavior. In sufficiently complex environments that approach the vastness and scale of the real world, however, attaining optimal performance may in fact be an entirely intractable endeavor and an agent may seldom find itself in a position to complete the requisite exploration for identifying an optimal policy. Recent work has leveraged tools from information theory to design agents that deliberately forgo optimal solutions in favor of sufficiently-satisfying or satisficing solutions, obtained through lossy compression. Notably, such agents may employ fundamentally different exploratory decisions to learn satisficing behaviors more efficiently than optimal ones that are more data intensive. While supported by a rigorous corroborating theory, the underlying algorithm relies on model-based planning, drastically limiting the compatibility of these ideas with function approximation and high-dimensional observations. In this work, we remedy this issue by extending an agent that directly represents uncertainty over the optimal value function allowing it to both bypass the need for model-based planning and to learn satisficing policies. We provide simple yet illustrative experiments that demonstrate how our algorithm enables deep reinforcement-learning agents to achieve satisficing behaviors. In keeping with previous work on this setting for multi-armed bandits, we additionally find that our algorithm is capable of synthesizing optimal behaviors, when feasible, more efficiently than its non-information-theoretic counterpart.

An Information-Theoretic Optimality Principle for Deep Reinforcement Learning

Deep Reinforcement Learning with Double Q-Learning

Data Efficient Deep Reinforcement Learning with Action-Ranked Temporal Difference Learning

Improve Value Estimation of Q Function and Reshape Reward with Monte Carlo Tree Search

State of the Art Control of Atari Games Using Shallow Reinforcement Learning

Using Deep Q-Learning to Control Optimization Hyperparameters

On Bellman's principle of optimality and Reinforcement learning for safety-constrained Markov decision process

Iterated $Q$-Network: Beyond One-Step Bellman Updates in Deep Reinforcement Learning

Information-Theoretic Confidence Bounds for Reinforcement Learning

Self Punishment and Reward Backfill for Deep Q-Learning

Learning Representations in Reinforcement Learning:An Information Bottleneck Approach

Expert Q-learning: Deep Reinforcement Learning with Coarse State Values from Offline Expert Examples

Satisficing Exploration for Deep Reinforcement Learning

Information-Theoretic Minimax Regret Bounds for Reinforcement Learning based on Duality

Implicitly Regularized RL with Implicit Q-Values

A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation

Deep Reinforcement Learning: A Convex Optimization Approach

Learning to Play in a Day: Faster Deep Reinforcement Learning by Optimality Tightening

Deep Q-Learning: Theoretical Insights from an Asymptotic Analysis

Mitigating Suboptimality of Deterministic Policy Gradients in Complex Q-functions