Grid-AR: A Grid-based Booster for Learned Cardinality Estimation and Range Joins

Damjan Gjurovski,Angjela Davitkova,Sebastian Michel
2024-10-10
Abstract:We propose an advancement in cardinality estimation by augmenting autoregressive models with a traditional grid structure. The novel hybrid estimator addresses the limitations of autoregressive models by creating a smaller representation of continuous columns and by incorporating a batch execution for queries with range predicates, as opposed to an iterative sampling approach. The suggested modification markedly improves the execution time of the model for both training and prediction, reduces memory consumption, and does so with minimal decline in accuracy. We further present an algorithm that enables the estimator to calculate cardinality estimates for range join queries efficiently. To validate the effectiveness of our cardinality estimator, we conduct and present a comprehensive evaluation considering state-of-the-art competitors using three benchmark datasets -- demonstrating vast improvements in execution times and resource utilization.
Databases
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of cardinality estimation in relational databases, especially the challenges encountered when dealing with range join queries. Specifically: 1. **Limitations of traditional methods**: - **Histogram and sampling methods**: Traditional methods such as histogram - or sampling - based cardinality estimation techniques are difficult to provide accurate estimates when dealing with complex queries or skewed data distributions. - **Limitations of autoregressive models**: - **Long execution time**: For queries containing range predicates, autoregressive models rely on progressive sampling, which leads to a long execution time. - **High memory consumption**: Autoregressive models require a large amount of memory during training and deployment, especially when dealing with columns with many different values. - **Poor adaptability**: These models are usually difficult to adapt to query distributions outside the training data. 2. **Proposed new method**: - **Grid - AR structure**: The authors propose a new hybrid structure, Grid - AR, which combines the traditional grid structure and autoregressive models. By replacing continuous attributes with grid cells, Grid - AR can significantly reduce the size and memory consumption of the model while increasing the query execution speed. - **Efficient processing of range join queries**: Grid - AR is not only suitable for single - table queries, but also proposes an efficient algorithm specifically for range join queries, which can significantly reduce the execution time and resource consumption while maintaining high accuracy. 3. **Specific goals**: - **Fast pre - filtering**: Pre - filter the data through the grid structure to reduce the number of tuples to be considered, thereby accelerating query processing. - **Avoid iterative sampling**: Replace the continuous attributes in the original data with the pre - filtered partitions, completely bypassing the iterative sampling process in the autoregressive model, greatly improving the execution speed. - **Reduce memory footprint**: By avoiding storing a large number of embeddings and dictionary mappings, especially for numerical columns, the overall memory consumption is reduced. 4. **Contributions**: - Proposed a new cardinality estimator, Grid - AR, which combines the grid and autoregressive structures. - Developed an efficient single - table query algorithm to reduce memory overhead through grid - structure pre - filtering. - Proposed an algorithm for estimating the cardinality of range join queries. - Conducted a comprehensive experimental evaluation on a synthetic dataset and two real - world datasets. In summary, this paper is committed to improving the method of cardinality estimation, especially when dealing with range join queries. By introducing the Grid - AR structure, it overcomes the limitations of existing methods, thereby achieving faster and more accurate query processing.