Abstract:Sampling-based approximate query processing method provides the way, in which the users can save their time and resources for 'Big Data' analytical applications, if the estimated results can satisfy the accuracy expectation earlier before a long wait for the final accurate results. Online aggregation (OLA) is such an attractive technology to respond aggregation queries by calculating approximate results with the confidence interval getting tighter over time. It has been built into the MapReuduce-based cloud system for big data analytics, which allows users to monitor the query progress and save money by killing the computation earlier once sufficient accuracy has been obtained. Unfortunately, there exists a major obstacle that is the estimation failure of OLA affects the OLA performance, which is resulted from the biased sample set that violates the unbiased assumption of OLA sampling. To handle this problem, we first propose a hybrid approximate query processing model to improve the overall OLA performance, where a dynamic scheme switching mechanism is deliberately designed to switch unpromising OLA queries into the bootstrap scheme for further processing, avoiding the whole dataset scanning resulted from the OLA estimation failure. In addition, we also present a progressive estimation method to reduce the false positive ratio of our dynamic scheme switching mechanism. Moreover, we have implemented our hybrid approximate query processing system in Hadoop, and conducted extensive experiments on the TPC-H benchmark for skewed data distribution. Our results demonstrate that our hybrid system can produce acceptable approximate results within a time period one order of magnitude shorter compared to the original OLA over Hadoop.

Needle in a Haystack: Max/Min Online Aggregation in the Cloud

An Efficient Block Sampling Strategy For Online Aggregation In The Cloud

Efficient Skew Handling in Online Aggregation in the Cloud

You can stop early with COLA: online processing of aggregate queries in the cloud.

Secure Computation of Maximum and Minimum Values in data Aggregation Based on Cloud Computing

COLA: A cloud-based system for online aggregation

MinMax Sampling: A Near-optimal Global Summary for Aggregation in the Wide Area

Online Aggregation: A Review

Partition-Based Online Aggregation with Shared Sampling in the Cloud

Moving Big Data to The Cloud: An Online Cost-Minimizing Approach

Approximate Processing of Massive Continuous Quantile Queries over High-Speed Data Streams

Privacy-Enhanced And Multifunctional Health Data Aggregation Under Differential Privacy Guarantees

Adaptive Optimization With Max-Min Achievable Rate Fairness In Mobile Cloud Networking

Distributed Online Aggregations

Cost Minimization Method for Multi-Source Big Data Processing in Clouds

A Sampling-Based Hybrid Approximate Query Processing System in the Cloud

Approximate Query Based Online Aggregation with Group-by and Application

Improving online aggregation performance for skewed data distribution

Neighborhood-privacy Protected Shortest Distance Computing in Cloud.

Answering the Min-Cost Quality-Aware Query on Multi-sources in Sensor-Cloud Systems

HEDC++:An Extended Histogram Estimator for Data in the Cloud