Abstract:With more and more enterprises and organizations outsourcing their IT services to distributed clouds for cost savings, historical and operational data generated by these services grows exponentially, which usually is stored in the data centers located at different geographic location in the distributed cloud. Such data referred to as big data now becomes an invaluable asset to many businesses or organizations, as it can be used to identify business advantages by helping them make their strategic decisions. Big data analytics thus is emerged as a main research topic in distributed cloud computing. The challenges associated with the query evaluation for big data analytics are that (i) its cloud resource demands are typically beyond the supplies by any single data center and expand to multiple data centers, and (ii) the source data of the query is located at different data centers. This creates heavy data traffic among the data centers in the distributed cloud, thereby resulting in high communication costs. A fundamental question for query evaluation of big data analytics thus is how to admit as many such queries as possible while keeping the accumulative communication cost minimized. In this paper, we investigate this question by formulating an online query evaluation problem for big data analytics in distributed clouds, with an objective to maximize the query acceptance ratio while minimizing the accumulative communication cost of query evaluation, for which we first propose a novel metric model to model different resource utilizations of data centres, by incorporating resource workloads and resource demands of each query. We then devise an efficient online algorithm. We finally conduct extensive experiments by simulations to evaluate the performance of the proposed algorithm. Experimental results demonstrate that the proposed algorithm is promising and outperforms other heuristics.

Efficient Skew Handling in Online Aggregation in the Cloud

An Efficient Block Sampling Strategy For Online Aggregation In The Cloud

Processing online aggregation on skewed data in mapreduce.

A Request Skew Aware Heterogeneous Distributed Storage System Based on Cassandra

Improving online aggregation performance for skewed data distribution

You can stop early with COLA: online processing of aggregate queries in the cloud.

Needle in a Haystack: Max/Min Online Aggregation in the Cloud

Research on Data Skew Join Algorithm Based on MapReduce Model

Sae: Toward Efficient Cloud Data Analysis Service for Large-Scale Social Networks

Partition-Based Online Aggregation with Shared Sampling in the Cloud

OATS: online aggregation with two-level sharing strategy in cloud

COLA: A cloud-based system for online aggregation

Handling Data Skew for Aggregation in Spark SQL Using Task Stealing.

Online Aggregation: A Review

An Adaptive Skew Handling Join Algorithm for Large-scale Data Analysis

Resisting Skew-Accumulation for Time-Stepped Applications in the Cloud Via Exploiting Parallelism

Data Locality-Aware Query Evaluation for Big Data Analytics in Distributed Clouds

DS<SUP>2</SUP> : Handling Data Skew Using Data Stealings over High-Speed Networks

A Comparative Study of Data Skew in Hadoop

CrowdOLA: Online Aggregation on Duplicate Data Powered by Crowdsourcing.

Ad-Hoc Aggregate Query Processing Algorithms Based on Bit-Store for Query Intensive Applications in Cloud Computing