Abstract:Online aggregation (OLA) is an attractive sampling-based technology to response aggregation queries by an approximate estimate to the final result, with the confidence interval becomes tighter over time. It has been built into the MapReduce-based cloud system for big data analytics, which allows users to monitor the query progress, and save money by killing the computation early once sufficient accuracy has been obtained. However, there is a serious limitation that restricts the performance of OLA that is the sharing issue of multiple OLA queries processing. Note that, in the original MapReduce paradigm, each query is processed independently without considering the potential sharing opportunities, leading to two major unnecessary additional execution costs: (1) the large redundant I/O cost, and (2) the replicative statistical computation cost. To eliminate such additional execution cost and improve the overall performance, we present online aggregation with two-level sharing strategy in cloud (OATS) based on MapReduce framework in this paper to effectively support online aggregation for large scale concurrent query processing in skewed data distribution. In the first-level sharing, we propose a sample buffer management mechanism to share the sampling opportunities among multiple OLA queries to reduce redundant I/O cost. While in the second-level sharing, we propose a heuristic algorithm (with a good scalability for large input) for the statistical computation to share partial statistics calculation to decrease the number of final aggregation operations, reducing the statistical computation cost. Based on such two-level sharing strategy, we have implemented OATS in Hadoop and conducted an extensive experiments study on the TPC-H benchmark for skewed data distribution. Our results demonstrate the efficiency and effectiveness of OATS.

An Efficient Block Sampling Strategy For Online Aggregation In The Cloud

Partition-Based Online Aggregation with Shared Sampling in the Cloud

You can stop early with COLA: online processing of aggregate queries in the cloud.

Efficient Skew Handling in Online Aggregation in the Cloud

Needle in a Haystack: Max/Min Online Aggregation in the Cloud

Progressive online aggregation in a distributed stream system

COLA: A cloud-based system for online aggregation

Efficient block-based sampling algorithm for aggregation query processing on duplicate charged records

Comparative Studies of Sampling for Analytics on Massive Data

Improving online aggregation performance for skewed data distribution

Online Aggregation: A Review

Continuous Sampling for Online Aggregation over Multiple Queries

OATS: online aggregation with two-level sharing strategy in cloud

A Sampling-Based Hybrid Approximate Query Processing System in the Cloud

Distributed Online Aggregations

AQapprox: Aggregation Queries Approximation with Distribution-Aware Online Sampling.

Ad-Hoc Aggregate Query Processing Algorithms Based on Bit-Store for Query Intensive Applications in Cloud Computing

Processing online aggregation on skewed data in mapreduce.

Approximate Query Based Online Aggregation with Group-by and Application

SAQP++: Bridging the Gap Between Sampling-Based Approximate Query Processing and Aggregate Precomputation.

Location-Aware Data Block Allocation Strategy for HDFS-Based Applications in the Cloud