Abstract:Deep reinforcement learning (RL) works impressively in some environments and fails catastrophically in others. Ideally, RL theory should be able to provide an understanding of why this is, i.e. bounds predictive of practical performance. Unfortunately, current theory does not quite have this ability. We compare standard deep RL algorithms to prior sample complexity bounds by introducing a new dataset, BRIDGE. It consists of 155 deterministic MDPs from common deep RL benchmarks, along with their corresponding tabular representations, which enables us to exactly compute instance-dependent bounds. We choose to focus on deterministic environments because they share many interesting properties of stochastic environments, but are easier to analyze. Using BRIDGE, we find that prior bounds do not correlate well with when deep RL succeeds vs. fails, but discover a surprising property that does. When actions with the highest Q-values under the random policy also have the highest Q-values under the optimal policy (i.e. when it is optimal to be greedy on the random policy's Q function), deep RL tends to succeed; when they don't, deep RL tends to fail. We generalize this property into a new complexity measure of an MDP that we call the effective horizon, which roughly corresponds to how many steps of lookahead search would be needed in that MDP in order to identify the next optimal action, when leaf nodes are evaluated with random rollouts. Using BRIDGE, we show that the effective horizon-based bounds are more closely reflective of the empirical performance of PPO and DQN than prior sample complexity bounds across four metrics. We also find that, unlike existing bounds, the effective horizon can predict the effects of using reward shaping or a pre-trained exploration policy. Our code and data are available at <a class="link-external link-https" href="https://github.com/cassidylaidlaw/effective-horizon" rel="external noopener nofollow">this https URL</a>

BASE: Bridging the Gap between Cost and Latency for Query Optimization

DROP: Conservative Model-based Optimization for Offline Reinforcement Learning

Beyond Reward: Offline Preference-guided Policy Optimization

Design from Policies: Conservative Test-Time Adaptation for Offline Policy Optimization

A Low Latency Adaptive Coding Spike Framework for Deep Reinforcement Learning

Faster or Cheaper : A Q-learning based cost-effective mixed cluster scaling method for achieving low tail latencies

LEON: A New Framework for ML-Aided Query Optimization.

Cost-Effective Proxy Reward Model Construction with On-Policy and Active Learning

Learning to Optimize Join Queries With Deep Reinforcement Learning

Lero: applying learning-to-rank in query optimizer

Learning to Optimize for Reinforcement Learning

Smart Knowledge Transfer-based Runtime Power Management

Lero: A Learning-to-Rank Query Optimizer.

A Framework for Mapping DRL Algorithms with Prioritized Replay Buffer onto Heterogeneous Platforms

Cost-based or Learning-based? A Hybrid Query Optimizer for Query Plan Selection

Bridging RL Theory and Practice with the Effective Horizon

Beyond Belady to Attain a Seemingly Unattainable Byte Miss Ratio for Content Delivery Networks

A Benchmark for Low-Switching-Cost Reinforcement Learning

Balsa: Learning a Query Optimizer Without Expert Demonstrations

Cost-Based or Learning-Based?

Dynamic Optimization of Storage Systems Using Reinforcement Learning Techniques