Weighted Distinct Sampling: Cardinality Estimation for SPJ Queries

Yuan Qiu,Yilei Wang,Ke Yi,Feifei Li,Bin Wu,Chaoqun Zhan
DOI: https://doi.org/10.1145/3448016.3452821
2021-01-01
Abstract:SPJ (select-project-join) queries form the backbone of many SQL queries used in practice. Accurate cardinality estimation of these queries is thus an important problem, with applications in query optimization, approximate query processing, and data analytics. However, this problem has not been rigorously addressed in the literature, despite the fact that cardinality estimation techniques of the three relational operators, selection, projection, and join, have each been extensively studied (but not when used in combination) in the past 30+ years. The major technical difficulty is that (distinct) projection seems to be difficult to combine with the other two operators when it comes to cardinality estimation. In this paper, we give the first formal study of cardinality estimation for SP queries. While it was studied in a prior work in 2001, there is no guarantee on its optimality. We define a class of algorithms, which we call weighted distinct sampling, for estimating SP query sizes, and show how to find a near-optimal sampling strategy that is away from the optimum only by a lower order term. We then extend it to handling SPJ queries, giving the first non-trivial solution for SPJ cardinality estimation. We have also performed an extensive experimental evaluation to complement our theoretical findings.
What problem does this paper attempt to address?