Pessimistic Cardinality Estimation

Mahmoud Abo Khamis,Kyle Deeds,Dan Olteanu,Dan Suciu
2024-12-01
Abstract:Cardinality Estimation is to estimate the size of the output of a query without computing it, by using only statistics on the input relations. Existing estimators try to return an unbiased estimate of the cardinality: this is notoriously difficult. A new class of estimators have been proposed recently, called "pessimistic estimators", which compute a guaranteed upper bound on the query output. Two recent advances have made pessimistic estimators practical. The first is the recent observation that degree sequences of the input relations can be used to compute query upper bounds. The second is a long line of theoretical results that have developed the use of information theoretic inequalities for query upper bounds. This paper is a short overview of pessimistic cardinality estimators, contrasting them with traditional estimators.
Databases,Information Theory
What problem does this paper attempt to address?
The problem that this paper attempts to solve is **the Cardinality Estimation (CE) of query output size**. Specifically, the paper focuses on how to estimate the size of query output based on the statistical information of input relations without actually computing the query results. Traditional estimation methods (such as density - based methods) have significant errors, especially when dealing with complex queries involving multiple joins and predicates. Therefore, this paper proposes and explores a new estimation method - **Pessimistic Cardinality Estimation (PCE)**. ### Limitations of Traditional Methods 1. **Density - based Estimation**: These methods rely on the assumptions of data uniformity and independence, but in practical applications, these assumptions are often not valid, resulting in large estimation errors. 2. **Sampling - based Estimation**: Although it can provide an unbiased estimate, it requires expensive data access and has poor performance for high - selectivity predicates (such as join operations). 3. **Machine - learning - based Estimation**: Although it can perform well on training data, there are many limitations in practical applications due to problems such as distribution drift, large memory occupation, and limited types of supported queries. ### Advantages of Pessimistic Estimation The main feature of pessimistic estimation is that it can provide a **guaranteed upper limit**, that is, no matter how the database instance changes, as long as the given statistical information is satisfied, the size of the query output will not exceed this upper limit. This provides theoretical guarantees for many application scenarios, such as ensuring that queries do not exhaust memory or providing an upper limit when allocating server resources for distributed queries. ### Main Contributions of the Paper 1. **Introduction of Degree Sequences**: Use the degree sequences of input relations to calculate the upper limit of query output. Degree sequences contain more information about data distribution and can significantly improve the accuracy of estimation. 2. **Application of Information Inequalities**: Further optimize the upper - limit estimation of query output by using inequalities in information theory (such as Shannon inequality). 3. **Chain Bound and Polymatroid Bound**: Propose two new estimation methods, Chain Bound and Polymatroid Bound, which respectively provide tighter upper limits in different situations. ### Summary This paper aims to solve the problems of large errors and lack of theoretical guarantees in existing query output size estimation methods. By introducing the pessimistic estimation method, especially by using degree sequences and information inequalities, it provides a more accurate and theoretically - guaranteed query output size estimation scheme. This not only helps to improve the performance of query optimizers but also can provide reliable estimation guarantees for other application scenarios.