Abstract:Efficiently analyzing geo-distributed datasets is emerging as a major demand in a cloud-edge system. Since the datasets are often generated in closer proximity to end users, traditional works mainly focus on offloading proper tasks from those hotspot edges to the datacenter to decrease the overall completion time of submitted jobs in a one-shot manner. However, optimizing the completion time of <italic>current job</italic> alone is insufficient in a long-term scope since some datasets would be used multiple times. Instead, optimizing the data distribution is much more efficient and could directly benefit forthcoming jobs, although it may postpone the execution of current one. Unfortunately, due to the throwaway feature of data fetcher, existing data analytics systems fail to re-distribute corresponding data out of hotspot edges after the execution of data analytics. In order to minimize the overall completion time for a <italic>sequence</italic> of jobs as well as to guarantee the performance of current one, we propose to re-distribute the data along with task offloading, and formulate corresponding <inline-formula><tex-math notation="LaTeX">$\varepsilon$</tex-math> <alternatives><mml:math><mml:mi>ɛ</mml:mi></mml:math><inline-graphic xlink:href="qian-ieq2-3086274.gif"/></alternatives></inline-formula>-bounded data-driven task scheduling problem over wide area network under the consideration of edge heterogeneity. We design an online schema <inline-formula><tex-math notation="LaTeX">$run$</tex-math> <alternatives><mml:math><mml:mrow><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="qian-ieq3-3086274.gif"/></alternatives></inline-formula>Data, which offloads proper tasks and related data via piggybacking to the datacenter based on delicately calculated probabilities. Through rigorous theoretical analysis, <inline-formula><tex-math notation="LaTeX">$run$</tex-math> <alternatives><mml:math><mml:mrow><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="qian-ieq4-3086274.gif"/></alternatives></inline-formula>Data is proved concentrated on its optimum with high probability. We implement <inline-formula><tex-math notation="LaTeX">$run$</tex-math> <alternatives><mml:math><mml:mrow><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="qian-ieq5-3086274.gif"/></alternatives></inline-formula>Data based on Spark and HDFS. Both testbed results and trace-driven simulations show that <inline-formula><tex-math notation="LaTeX">$run$</tex-math> <alternatives><mml:math><mml:mrow><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="qian-ieq6-3086274.gif"/></alternatives></inline-formula>Data re-distributes proper data via piggybacking and achieves up to 37 percent reduction on average response time compared with state-of-the-art schemas.

Run Data Run! Re-Distributing Data via Piggybacking for Geo-Distributed Data Analytics

<i>run</i>Data: Re-Distributing Data via Piggybacking for Geo-Distributed Data Analytics Over Edges

Performance-Driven Task and Data Co-scheduling Algorithms for Data-Intensive Applications in Grid Computing

Ran-Gjs

Adaptive priority-based data placement and multi-task scheduling in geo-distributed cloud systems

Workload-Aware Scheduling Across Geo-Distributed Data Centers

Cost-Efficient Task Scheduling for Geo-distributed Data Analytics.

PingAn: An Insurance Scheme for Job Acceleration in Geo-distributed Big Data Analytics System

Load scheduling for distributed edge computing: A communication-computation tradeoff

Towards Reliable (and Efficient) Job Executions in a Practical Geo-distributed Data Analytics System

A MapReduce Cluster Deployment Optimization Framework with Geo-distributed Data.

Towards Efficient Graph Processing in Geo-Distributed Data Centers

Big Data Processing Workflows Oriented Real-Time Scheduling Algorithm using Task-Duplication in Geo-Distributed Clouds

Data Based Application Partitioning and Workload Balance in Distributed Environment

GeoClone: Online Task Replication and Scheduling for Geo-Distributed Analytics under Uncertainties.

Energy-Aware Cloud Workflow Applications Scheduling With Geo-Distributed Data

Efficient. Scalable and Robust Data Shuffle Service for Distributed MapReduce Computing on Cloud

Fault-tolerant scheduling and data placement for scientific workflow processing in geo-distributed clouds

Evaluating Data Redistribution in PaRSEC

Energy-efficient Analytics for Geographically Distributed Big Data