Histogram-Based Estimation Techniques in Database Systems

V. Poosala
Abstract:Many commercial database management systems maintain histograms to summarize the contents of relations in order to perform efficient estimation of query result sizes and access plan costs. The accuracy of these estimates is often of critical importance. But, there has never been a systematic study of all histogram aspects and their effectiveness in providing accurate estimations. In this thesis, we identify (theoretically and experimentally) the most accurate classes of histograms for estimating the sizes and distributions of the results of several important query operators and provide efficient (sampling-based) techniques to construct these histograms. All of these histograms are novel and differ in fundamental ways from traditional histograms. We provide a systematic classification of all classes of histograms based on certain canonical aspects of histograms that determine their effectiveness in a given estimation problem. We also provide techniques to capture dependencies between attributes in a relation and show that these techniques are far more accurate than the traditional attribute independence assumption. Finally, we use histograms to effectively balance load during parallel joins, thus demonstrating their versatility. Our over all conclusion from the accuracy and efficiency of histogram-based techniques is that they can be used both for enhanced accuracy in traditional applications (e.g., query optimizers) as well as in novel applications that can benefit from estimates (e.g., approximate query processors and load balancers).
What problem does this paper attempt to address?