Abstract:Efficiently analyzing geo-distributed datasets is emerging as a major demand in a cloud-edge system. Since the datasets are often generated in closer proximity to end users, traditional works mainly focus on offloading proper tasks from those hotspot edges to the datacenter to decrease the overall completion time of submitted jobs in a one-shot manner. However, optimizing the completion time of <italic>current job</italic> alone is insufficient in a long-term scope since some datasets would be used multiple times. Instead, optimizing the data distribution is much more efficient and could directly benefit forthcoming jobs, although it may postpone the execution of current one. Unfortunately, due to the throwaway feature of data fetcher, existing data analytics systems fail to re-distribute corresponding data out of hotspot edges after the execution of data analytics. In order to minimize the overall completion time for a <italic>sequence</italic> of jobs as well as to guarantee the performance of current one, we propose to re-distribute the data along with task offloading, and formulate corresponding <inline-formula><tex-math notation="LaTeX">$\varepsilon$</tex-math> <alternatives><mml:math><mml:mi>ɛ</mml:mi></mml:math><inline-graphic xlink:href="qian-ieq2-3086274.gif"/></alternatives></inline-formula>-bounded data-driven task scheduling problem over wide area network under the consideration of edge heterogeneity. We design an online schema <inline-formula><tex-math notation="LaTeX">$run$</tex-math> <alternatives><mml:math><mml:mrow><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="qian-ieq3-3086274.gif"/></alternatives></inline-formula>Data, which offloads proper tasks and related data via piggybacking to the datacenter based on delicately calculated probabilities. Through rigorous theoretical analysis, <inline-formula><tex-math notation="LaTeX">$run$</tex-math> <alternatives><mml:math><mml:mrow><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="qian-ieq4-3086274.gif"/></alternatives></inline-formula>Data is proved concentrated on its optimum with high probability. We implement <inline-formula><tex-math notation="LaTeX">$run$</tex-math> <alternatives><mml:math><mml:mrow><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="qian-ieq5-3086274.gif"/></alternatives></inline-formula>Data based on Spark and HDFS. Both testbed results and trace-driven simulations show that <inline-formula><tex-math notation="LaTeX">$run$</tex-math> <alternatives><mml:math><mml:mrow><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="qian-ieq6-3086274.gif"/></alternatives></inline-formula>Data re-distributes proper data via piggybacking and achieves up to 37 percent reduction on average response time compared with state-of-the-art schemas.

Cheetah: A Dynamic Performance Optimization Approach on Heterogeneous Big Data Analytics Cluster

Improving MapReduce Performance with Partial Speculative Execution

Improving MapReduce Performance Using Smart Speculative Execution Strategy

Optimizing Resource Allocation for Data-Parallel Jobs Via GCN-Based Prediction

Efficient. Scalable and Robust Data Shuffle Service for Distributed MapReduce Computing on Cloud

A Reinforcement Learning Based Backfilling Strategy for HPC Batch Jobs

Vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

ReLoca: Optimize Resource Allocation for Data-parallel Jobs Using Deep Learning

EP4DDL: addressing straggler problem in heterogeneous distributed deep learning

Optimizing resources to mitigate stragglers through virtualization in run time

Moving Hadoop into the Cloud with Flexible Slot Management and Speculative Execution

TR-Spark

Adaptive memory reservation strategy for heavy workloads in the Spark environment

Towards General and Efficient Online Tuning for Spark

Adaptive Scheduling Framework of Streaming Applications based on Resource Demand Prediction with Hybrid Algorithms

Learning Interpretable Scheduling Algorithms for Data Processing Clusters

A Real-Time Scheduling Strategy Based on Processing Framework of Hadoop

DeepCAT+: A Low-Cost and Transferrable Online Configuration Auto-Tuning Approach for Big Data Frameworks

RLScheduler: An Automated HPC Batch Job Scheduler Using Reinforcement Learning

Optimizing data locality by executor allocation in spark computing environment