Abstract:Efficiently analyzing geo-distributed datasets is emerging as a major demand in a cloud-edge system. Since the datasets are often generated in closer proximity to end users, traditional works mainly focus on offloading proper tasks from those hotspot edges to the datacenter to decrease the overall completion time of submitted jobs in a one-shot manner. However, optimizing the completion time of <italic>current job</italic> alone is insufficient in a long-term scope since some datasets would be used multiple times. Instead, optimizing the data distribution is much more efficient and could directly benefit forthcoming jobs, although it may postpone the execution of current one. Unfortunately, due to the throwaway feature of data fetcher, existing data analytics systems fail to re-distribute corresponding data out of hotspot edges after the execution of data analytics. In order to minimize the overall completion time for a <italic>sequence</italic> of jobs as well as to guarantee the performance of current one, we propose to re-distribute the data along with task offloading, and formulate corresponding <inline-formula><tex-math notation="LaTeX">$\varepsilon$</tex-math> <alternatives><mml:math><mml:mi>ɛ</mml:mi></mml:math><inline-graphic xlink:href="qian-ieq2-3086274.gif"/></alternatives></inline-formula>-bounded data-driven task scheduling problem over wide area network under the consideration of edge heterogeneity. We design an online schema <inline-formula><tex-math notation="LaTeX">$run$</tex-math> <alternatives><mml:math><mml:mrow><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="qian-ieq3-3086274.gif"/></alternatives></inline-formula>Data, which offloads proper tasks and related data via piggybacking to the datacenter based on delicately calculated probabilities. Through rigorous theoretical analysis, <inline-formula><tex-math notation="LaTeX">$run$</tex-math> <alternatives><mml:math><mml:mrow><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="qian-ieq4-3086274.gif"/></alternatives></inline-formula>Data is proved concentrated on its optimum with high probability. We implement <inline-formula><tex-math notation="LaTeX">$run$</tex-math> <alternatives><mml:math><mml:mrow><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="qian-ieq5-3086274.gif"/></alternatives></inline-formula>Data based on Spark and HDFS. Both testbed results and trace-driven simulations show that <inline-formula><tex-math notation="LaTeX">$run$</tex-math> <alternatives><mml:math><mml:mrow><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="qian-ieq6-3086274.gif"/></alternatives></inline-formula>Data re-distributes proper data via piggybacking and achieves up to 37 percent reduction on average response time compared with state-of-the-art schemas.

A Query Execution Scheduling Scheme for Impala System.

Logical Query Optimization for Cloudera Impala System

Query grouping-based multi-query optimization framework for interactive SQL query engines on Hadoop.

Query optimization for massively parallel data processing.

Performance-Driven Task and Data Co-scheduling Algorithms for Data-Intensive Applications in Grid Computing

AQUA+: Query Optimization for Hybrid Database-MapReduce System.

AQP++: Connecting Approximate Query Processing with Aggregate Precomputation for Interactive Analytics

The performance of MapReduce: an in-depth study

Distributed scheduling and storage scheme based on LSM-OCTree for spatiotemporal stream

Scheduling of Intermittent Query Processing

The Performance of MapReduce

Optimization of sub-query processing in distributed data integration systems

Cost-Based Optimization Of Logical Partitions For A Query Workload In A Hadoop Data Warehouse

An Adaptive Data Partitioning Scheme For Accelerating Exploratory Spark Sql Queries

PingAn: An Insurance Scheme for Job Acceleration in Geo-distributed Big Data Analytics System

Performance optimization of computing task scheduling based on the Hadoop big data platform

Optimizing Internal Overlaps by Self-Adjusting Resource Allocation in Multi-Stage Computing Systems

A Scheduling Strategy Based on Multi-Queues of Cassandra.

Magpie: Efficient Big Data Query System Parameter Optimization Based on Pre-selection and Search Pruning Approach.

SMART-IMPALA: Efficient Querying of hyper Massive Spatiotemporal Trajectory Data