Bridging RL Theory and Practice with the Effective Horizon

Cassidy Laidlaw,Stuart Russell,Anca Dragan
2024-01-12
Abstract:Deep reinforcement learning (RL) works impressively in some environments and fails catastrophically in others. Ideally, RL theory should be able to provide an understanding of why this is, i.e. bounds predictive of practical performance. Unfortunately, current theory does not quite have this ability. We compare standard deep RL algorithms to prior sample complexity bounds by introducing a new dataset, BRIDGE. It consists of 155 deterministic MDPs from common deep RL benchmarks, along with their corresponding tabular representations, which enables us to exactly compute instance-dependent bounds. We choose to focus on deterministic environments because they share many interesting properties of stochastic environments, but are easier to analyze. Using BRIDGE, we find that prior bounds do not correlate well with when deep RL succeeds vs. fails, but discover a surprising property that does. When actions with the highest Q-values under the random policy also have the highest Q-values under the optimal policy (i.e. when it is optimal to be greedy on the random policy's Q function), deep RL tends to succeed; when they don't, deep RL tends to fail. We generalize this property into a new complexity measure of an MDP that we call the effective horizon, which roughly corresponds to how many steps of lookahead search would be needed in that MDP in order to identify the next optimal action, when leaf nodes are evaluated with random rollouts. Using BRIDGE, we show that the effective horizon-based bounds are more closely reflective of the empirical performance of PPO and DQN than prior sample complexity bounds across four metrics. We also find that, unlike existing bounds, the effective horizon can predict the effects of using reward shaping or a pre-trained exploration policy. Our code and data are available at <a class="link-external link-https" href="https://github.com/cassidylaidlaw/effective-horizon" rel="external noopener nofollow">this https URL</a>
Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the huge gap between current reinforcement learning (RL) theory and practice. Although deep reinforcement learning performs extremely well in some environments, it can completely fail in others. Ideally, RL theory should be able to provide a framework for understanding this phenomenon, that is, it should be able to predict the upper and lower bounds of actual performance. However, existing theories do not have this ability. Specifically, the paper focuses on the following problems: 1. **The gap between theory and practice**: Current RL theory cannot well explain why some deep RL algorithms succeed in specific environments but fail in others. The paper aims to narrow this gap by introducing a new complexity measure - effective horizon. 2. **The predictive ability of sample complexity**: The existing upper and lower bounds of sample complexity cannot well predict the actual performance of deep RL algorithms. By creating a new dataset BRIDGE, the paper compares standard deep RL algorithms with previous upper and lower bounds of sample complexity and finds that there is an inconsistency between existing theories and actual performance. 3. **The effectiveness of random policies**: The paper finds that in many environments, the action with the highest Q - value in the random policy is also the action with the highest Q - value in the optimal policy. This finding is of great significance for understanding the success and failure of deep RL. 4. **The definition and application of effective horizon**: The paper proposes the concept of effective horizon, which is a new index for measuring the complexity of MDP. The effective horizon roughly corresponds to how many steps of look - ahead search are required in an MDP to determine the next optimal action. The paper proves that the sample complexity upper and lower bounds based on effective horizon are closer to the actual performance of PPO and DQN, and can predict the effects of reward shaping or pre - training exploration strategies. ### Key contributions of the paper 1. **BRIDGE dataset**: The paper creates a dataset BRIDGE containing 155 deterministic MDPs. These MDPs are from common deep RL benchmark test environments and their tabular representations are provided. This enables researchers to accurately calculate instance - dependent upper and lower bounds. 2. **The concept of effective horizon**: The paper proposes a new complexity measure - effective horizon, for measuring the complexity of an MDP. The effective horizon combines the depth of look - ahead search and the number of random rollbacks, and can better predict the performance of deep RL algorithms. 3. **Theoretical and empirical analysis**: The paper verifies the effectiveness of effective horizon through theoretical analysis and empirical experiments. The results show that the sample complexity upper and lower bounds based on effective horizon can better reflect the actual performance of PPO and DQN than other existing upper and lower bounds. 4. **Understanding of random policies**: The paper finds that in many environments, the action with the highest Q - value in the random policy is also the action with the highest Q - value in the optimal policy. This finding not only helps to understand the success mechanism of deep RL, but also provides ideas for designing simpler RL algorithms. ### Conclusion By introducing the new concept of effective horizon, the paper provides a new theoretical framework for understanding the performance of deep RL in different environments. The effective horizon can not only better predict the performance of deep RL algorithms, but also explain why some techniques (such as reward shaping and pre - training strategies) can improve the performance of RL. These findings are of great significance for promoting the combination of RL theory and practice.