JoinGym: An Efficient Query Optimization Environment for Reinforcement Learning

Kaiwen Wang,Junxiong Wang,Yueying Li,Nathan Kallus,Immanuel Trummer,Wen Sun
2023-10-18
Abstract:Join order selection (JOS) is the problem of ordering join operations to minimize total query execution cost and it is the core NP-hard combinatorial optimization problem of query optimization. In this paper, we present JoinGym, a lightweight and easy-to-use query optimization environment for reinforcement learning (RL) that captures both the left-deep and bushy variants of the JOS problem. Compared to existing query optimization environments, the key advantages of JoinGym are usability and significantly higher throughput which we accomplish by simulating query executions entirely offline. Under the hood, JoinGym simulates a query plan's cost by looking up intermediate result cardinalities from a pre-computed dataset. We release a novel cardinality dataset for $3300$ SQL queries based on real IMDb workloads which may be of independent interest, e.g., for cardinality estimation. Finally, we extensively benchmark four RL algorithms and find that their cost distributions are heavy-tailed, which motivates future work in risk-sensitive RL. In sum, JoinGym enables users to rapidly prototype RL algorithms on realistic database problems without needing to setup and run live systems.
Machine Learning
What problem does this paper attempt to address?
This paper attempts to address the Join Order Selection (JOS) problem in database query optimization. Specifically, JOS is a combinatorial optimization problem with the goal of finding the join operation sequence with the lowest cost during query execution to minimize the total query execution cost. This problem is NP-hard, making it very challenging in practical applications. ### Main Contributions: 1. **Lightweight Simulator**: The paper introduces **JOINGYM**, a lightweight and easy-to-use query optimization environment specifically designed for Reinforcement Learning (RL). Compared to existing query optimization environments, the main advantages of JOINGYM are higher usability and significantly higher throughput, achieved by fully offline simulating query execution. 2. **Novel Dataset**: The authors released a dataset containing intermediate result (IR) cardinalities for 3300 SQL queries based on real IMDb workloads. This dataset is not only used for JOINGYM but can also be independently used for research such as cardinality estimation. 3. **Benchmarking**: The paper extensively benchmarks four RL algorithms and finds that their cost distributions are long-tailed, motivating future research in risk-sensitive RL. ### Problems Addressed: - **Long-Tail Return Distribution**: The cost distribution of the JOS problem is typically long-tailed, meaning that most queries have low costs, but a few have very high costs. This poses a challenge for the generalization ability of RL algorithms. - **Generalization in Combinatorial Optimization**: The JOS problem is a discrete combinatorial optimization problem, and effectively generalizing across different query instances is a key challenge. - **Partial Observability**: In practical database query optimization, query embeddings are lossy compressions of the underlying table contents, making it impossible to fully determine the cardinalities of intermediate results. This partial observability increases the complexity of the problem. ### Implementation Methods: - **Offline Simulation**: JOINGYM simulates the cost of query plans by looking up precomputed intermediate result cardinalities instead of executing queries online. This approach allows JOINGYM to simulate thousands of queries per second on a personal laptop, whereas traditional online methods might take hours or even days. - **Partially Observable Contextual Markov Decision Process (POCMDP)**: The JOS problem is modeled as a POCMDP, where the design of states, actions, and rewards takes into account the characteristics of partial observability and combinatorial optimization. ### Conclusion: JOINGYM provides an efficient and realistic environment that enables researchers to quickly prototype RL algorithms without the need to set up and run real systems, thereby accelerating research progress in the intersection of data systems and ML&RL.