LearnedSQLGen: Constraint-aware SQL Generation Using Reinforcement Learning
Lixi Zhang,Chengliang Chai,Xuanhe Zhou,Guoliang Li
DOI: https://doi.org/10.1145/3514221.3526155
2022-01-01
Abstract:Many database optimization problems, e.g., slow SQL diagnosis, database testing, optimizer tuning, require a large volume of SQL queries. Due to privacy issues, it is hard to obtain real SQL queries, and thus SQL generation is a very important task in database optimization. Existing SQL generation methods either randomly generate SQL queries or rely on human-crafted SQL templates to generate SQL queries, but they cannot meet various user specific requirements, e.g., slow SQL queries, SQL queries with large result sizes. To address this problem, this paper studies the problem of constraint-aware SQL generation, which, given a constraint (e.g., cardinality within [1k,2k]), generates SQL queries satisfying the constraint. This problem is rather challenging, because it is rather hard to capture the relationship from query constraint (e.g., cardinality and cost) to SQL queries and thus it is hard to guide a generation method to explore the SQL generation direction towards meeting the constraint. To address this challenge, we propose a reinforcement learning (RL) based framework LearnedSQLGen, for generating queries satisfying the constraint. LearnedSQLGen adopts an exploration-exploitation strategy that exploits the generation direction following the query constraint, which is learned from query execution feedback. We judiciously design the reward function in RL to guide the generation process accurately. We integrate a finite-state machine to generate valid SQL queries. Experimental results on three benchmarks showed that LearnedSQLGen significantly outperformed the baselines in terms of both accuracy (30% better) and efficiency (10-35 times).