Abstract:Deep reinforcement learning (RL) works impressively in some environments and fails catastrophically in others. Ideally, RL theory should be able to provide an understanding of why this is, i.e. bounds predictive of practical performance. Unfortunately, current theory does not quite have this ability. We compare standard deep RL algorithms to prior sample complexity bounds by introducing a new dataset, BRIDGE. It consists of 155 deterministic MDPs from common deep RL benchmarks, along with their corresponding tabular representations, which enables us to exactly compute instance-dependent bounds. We choose to focus on deterministic environments because they share many interesting properties of stochastic environments, but are easier to analyze. Using BRIDGE, we find that prior bounds do not correlate well with when deep RL succeeds vs. fails, but discover a surprising property that does. When actions with the highest Q-values under the random policy also have the highest Q-values under the optimal policy (i.e. when it is optimal to be greedy on the random policy's Q function), deep RL tends to succeed; when they don't, deep RL tends to fail. We generalize this property into a new complexity measure of an MDP that we call the effective horizon, which roughly corresponds to how many steps of lookahead search would be needed in that MDP in order to identify the next optimal action, when leaf nodes are evaluated with random rollouts. Using BRIDGE, we show that the effective horizon-based bounds are more closely reflective of the empirical performance of PPO and DQN than prior sample complexity bounds across four metrics. We also find that, unlike existing bounds, the effective horizon can predict the effects of using reward shaping or a pre-trained exploration policy. Our code and data are available at <a class="link-external link-https" href="https://github.com/cassidylaidlaw/effective-horizon" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the huge gap between current reinforcement learning (RL) theory and practice. Although deep reinforcement learning performs extremely well in some environments, it can completely fail in others. Ideally, RL theory should be able to provide a framework for understanding this phenomenon, that is, it should be able to predict the upper and lower bounds of actual performance. However, existing theories do not have this ability. Specifically, the paper focuses on the following problems: 1. **The gap between theory and practice**: Current RL theory cannot well explain why some deep RL algorithms succeed in specific environments but fail in others. The paper aims to narrow this gap by introducing a new complexity measure - effective horizon. 2. **The predictive ability of sample complexity**: The existing upper and lower bounds of sample complexity cannot well predict the actual performance of deep RL algorithms. By creating a new dataset BRIDGE, the paper compares standard deep RL algorithms with previous upper and lower bounds of sample complexity and finds that there is an inconsistency between existing theories and actual performance. 3. **The effectiveness of random policies**: The paper finds that in many environments, the action with the highest Q - value in the random policy is also the action with the highest Q - value in the optimal policy. This finding is of great significance for understanding the success and failure of deep RL. 4. **The definition and application of effective horizon**: The paper proposes the concept of effective horizon, which is a new index for measuring the complexity of MDP. The effective horizon roughly corresponds to how many steps of look - ahead search are required in an MDP to determine the next optimal action. The paper proves that the sample complexity upper and lower bounds based on effective horizon are closer to the actual performance of PPO and DQN, and can predict the effects of reward shaping or pre - training exploration strategies. ### Key contributions of the paper 1. **BRIDGE dataset**: The paper creates a dataset BRIDGE containing 155 deterministic MDPs. These MDPs are from common deep RL benchmark test environments and their tabular representations are provided. This enables researchers to accurately calculate instance - dependent upper and lower bounds. 2. **The concept of effective horizon**: The paper proposes a new complexity measure - effective horizon, for measuring the complexity of an MDP. The effective horizon combines the depth of look - ahead search and the number of random rollbacks, and can better predict the performance of deep RL algorithms. 3. **Theoretical and empirical analysis**: The paper verifies the effectiveness of effective horizon through theoretical analysis and empirical experiments. The results show that the sample complexity upper and lower bounds based on effective horizon can better reflect the actual performance of PPO and DQN than other existing upper and lower bounds. 4. **Understanding of random policies**: The paper finds that in many environments, the action with the highest Q - value in the random policy is also the action with the highest Q - value in the optimal policy. This finding not only helps to understand the success mechanism of deep RL, but also provides ideas for designing simpler RL algorithms. ### Conclusion By introducing the new concept of effective horizon, the paper provides a new theoretical framework for understanding the performance of deep RL in different environments. The effective horizon can not only better predict the performance of deep RL algorithms, but also explain why some techniques (such as reward shaping and pre - training strategies) can improve the performance of RL. These findings are of great significance for promoting the combination of RL theory and practice.

Bridging RL Theory and Practice with the Effective Horizon

The Effective Horizon Explains Deep RL Performance in Stochastic Environments

Bridging the gap between Markowitz planning and deep reinforcement learning

Model-based RL as a Minimalist Approach to Horizon-Free and Second-Order Bounds

Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism

PROVABLY BENEFITS OF DEEP HIERARCHICAL RL

On the Effective Horizon of Inverse Reinforcement Learning

Bridging Imitation and Online Reinforcement Learning: An Optimistic Tale

Bridging State and History Representations: Understanding Self-Predictive RL

Bridging the Sim-to-Real Gap from the Information Bottleneck Perspective

Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees

A Benchmark Study of Deep-RL Methods for Maximum Coverage Problems over Graphs

Resource Constrained Deep Reinforcement Learning

Reinforcement Learning for Branch-and-Bound Optimisation using Retrospective Trajectories

Challenges of real-world reinforcement learning: definitions, benchmarks and analysis

Efficient Deep Reinforcement Learning Requires Regulating Overfitting

Learning to Bridge the Gap: Efficient Novelty Recovery with Planning and Reinforcement Learning

Benchmarking Safe Exploration in Deep Reinforcement Learning

Exploiting Multiple Abstractions in Episodic RL via Reward Shaping

RL-CFR: Improving Action Abstraction for Imperfect Information Extensive-Form Games with Reinforcement Learning