Abstract:Efficiently analyzing geo-distributed datasets is emerging as a major demand in a cloud-edge system. Since the datasets are often generated in closer proximity to end users, traditional works mainly focus on offloading proper tasks from those hotspot edges to the datacenter to decrease the overall completion time of submitted jobs in a one-shot manner. However, optimizing the completion time of <italic>current job</italic> alone is insufficient in a long-term scope since some datasets would be used multiple times. Instead, optimizing the data distribution is much more efficient and could directly benefit forthcoming jobs, although it may postpone the execution of current one. Unfortunately, due to the throwaway feature of data fetcher, existing data analytics systems fail to re-distribute corresponding data out of hotspot edges after the execution of data analytics. In order to minimize the overall completion time for a <italic>sequence</italic> of jobs as well as to guarantee the performance of current one, we propose to re-distribute the data along with task offloading, and formulate corresponding <inline-formula><tex-math notation="LaTeX">$\varepsilon$</tex-math> <alternatives><mml:math><mml:mi>ɛ</mml:mi></mml:math><inline-graphic xlink:href="qian-ieq2-3086274.gif"/></alternatives></inline-formula>-bounded data-driven task scheduling problem over wide area network under the consideration of edge heterogeneity. We design an online schema <inline-formula><tex-math notation="LaTeX">$run$</tex-math> <alternatives><mml:math><mml:mrow><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="qian-ieq3-3086274.gif"/></alternatives></inline-formula>Data, which offloads proper tasks and related data via piggybacking to the datacenter based on delicately calculated probabilities. Through rigorous theoretical analysis, <inline-formula><tex-math notation="LaTeX">$run$</tex-math> <alternatives><mml:math><mml:mrow><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="qian-ieq4-3086274.gif"/></alternatives></inline-formula>Data is proved concentrated on its optimum with high probability. We implement <inline-formula><tex-math notation="LaTeX">$run$</tex-math> <alternatives><mml:math><mml:mrow><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="qian-ieq5-3086274.gif"/></alternatives></inline-formula>Data based on Spark and HDFS. Both testbed results and trace-driven simulations show that <inline-formula><tex-math notation="LaTeX">$run$</tex-math> <alternatives><mml:math><mml:mrow><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="qian-ieq6-3086274.gif"/></alternatives></inline-formula>Data re-distributes proper data via piggybacking and achieves up to 37 percent reduction on average response time compared with state-of-the-art schemas.

TR-Spark

Distributed High-Dimension Matrix Operation Optimization on Spark

SparkRDF: Elastic Discreted RDF Graph Processing Engine with Distributed Memory

A Benchmarking Study to Evaluate Apache Spark on Large-Scale Supercomputers

Towards General and Efficient Online Tuning for Spark

Adaptive memory reservation strategy for heavy workloads in the Spark environment

A Survey on Spark Ecosystem for Big Data Processing

Swift: Reliable and Low-Latency Data Processing at Cloud Scale

Intelligent Pooling: Proactive Resource Provisioning in Large-scale Cloud Service

Optimizing Resource Allocation for Data-Parallel Jobs Via GCN-Based Prediction

Efficient Straggler Replication in Large-Scale Parallel Computing

Efficient. Scalable and Robust Data Shuffle Service for Distributed MapReduce Computing on Cloud

Adaptive priority-based data placement and multi-task scheduling in geo-distributed cloud systems

On-the-Fly Fusion of Remotely-Sensed Big Data Using an Elastic Computing Paradigm with a Containerized Spark Engine on Kubernetes

SparkDQ: Efficient Generic Big Data Quality Management on Distributed Data-Parallel Computation

A Real-Time Partition Generation Mechanism for Data Skew Mitigation in Spark Computing Environment

Building a Productive Domain-Specific Cloud for Big Data Processing and Analytics Service

Run Data Run! Re-Distributing Data via Piggybacking for Geo-Distributed Data Analytics

CASH: A Credit Aware Scheduling for Public Cloud Platforms

<i>run</i>Data: Re-Distributing Data via Piggybacking for Geo-Distributed Data Analytics Over Edges