Abstract:Efficiently analyzing geo-distributed datasets is emerging as a major demand in a cloud-edge system. Since the datasets are often generated in closer proximity to end users, traditional works mainly focus on offloading proper tasks from those hotspot edges to the datacenter to decrease the overall completion time of submitted jobs in a one-shot manner. However, optimizing the completion time of <italic>current job</italic> alone is insufficient in a long-term scope since some datasets would be used multiple times. Instead, optimizing the data distribution is much more efficient and could directly benefit forthcoming jobs, although it may postpone the execution of current one. Unfortunately, due to the throwaway feature of data fetcher, existing data analytics systems fail to re-distribute corresponding data out of hotspot edges after the execution of data analytics. In order to minimize the overall completion time for a <italic>sequence</italic> of jobs as well as to guarantee the performance of current one, we propose to re-distribute the data along with task offloading, and formulate corresponding <inline-formula><tex-math notation="LaTeX">$\varepsilon$</tex-math> <alternatives><mml:math><mml:mi>ɛ</mml:mi></mml:math><inline-graphic xlink:href="qian-ieq2-3086274.gif"/></alternatives></inline-formula>-bounded data-driven task scheduling problem over wide area network under the consideration of edge heterogeneity. We design an online schema <inline-formula><tex-math notation="LaTeX">$run$</tex-math> <alternatives><mml:math><mml:mrow><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="qian-ieq3-3086274.gif"/></alternatives></inline-formula>Data, which offloads proper tasks and related data via piggybacking to the datacenter based on delicately calculated probabilities. Through rigorous theoretical analysis, <inline-formula><tex-math notation="LaTeX">$run$</tex-math> <alternatives><mml:math><mml:mrow><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="qian-ieq4-3086274.gif"/></alternatives></inline-formula>Data is proved concentrated on its optimum with high probability. We implement <inline-formula><tex-math notation="LaTeX">$run$</tex-math> <alternatives><mml:math><mml:mrow><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="qian-ieq5-3086274.gif"/></alternatives></inline-formula>Data based on Spark and HDFS. Both testbed results and trace-driven simulations show that <inline-formula><tex-math notation="LaTeX">$run$</tex-math> <alternatives><mml:math><mml:mrow><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="qian-ieq6-3086274.gif"/></alternatives></inline-formula>Data re-distributes proper data via piggybacking and achieves up to 37 percent reduction on average response time compared with state-of-the-art schemas.

Network-Adaptive Scheduling of Data-Intensive Parallel Jobs with Dependencies in Clusters

Performance-Driven Task and Data Co-scheduling Algorithms for Data-Intensive Applications in Grid Computing

A Novel Job Scheduling Model to Enhance Efficiency and Overall User Fairness of Cloud Computing Environment.

A Deadline-Aware Coflow Scheduling Approach for Big Data Applications.

Communication-Efficient Task Scheduling for Real-Time Distributed Computing.

Co-Scheduler: A Coflow-Aware Data-Parallel Job Scheduler in Hybrid Electrical/Optical Datacenter Networks

Adaptive priority-based data placement and multi-task scheduling in geo-distributed cloud systems

Scheduling Distributed Deep Learning Jobs in Heterogeneous Cluster with Placement Awareness

A Near Optimal Multi-Faced Job Scheduler For Datacenter Workloads

Distributed Bottleneck-Aware Coflow Scheduling in Data Centers

Online Job Scheduling in Distributed Machine Learning Clusters

Hypergraph-partitioning-based online joint scheduling of tasks and data

GPU Cluster Scheduling for Network-Sensitive Deep Learning

DRESS: Dynamic RESource-Reservation Scheme for Congested Data-Intensive Computing Platforms

Network-Aware Locality Scheduling for Distributed Data Operators in Data Centers

Preemptive and Low Latency Datacenter Scheduling via Lightweight Containers

Do the Hard Stuff First: Scheduling Dependent Computations in Data-Analytics Clusters

CloudCoaster: Transient-aware Bursty Datacenter Workload Scheduling

An Optimal Network-Aware Scheduling Technique for Distributed Deep Learning in Distributed HPC Platforms

Optimization of Big Data Parallel Scheduling Based on Dynamic Clustering Scheduling Algorithm