Abstract:Efficiently analyzing geo-distributed datasets is emerging as a major demand in a cloud-edge system. Since the datasets are often generated in closer proximity to end users, traditional works mainly focus on offloading proper tasks from those hotspot edges to the datacenter to decrease the overall completion time of submitted jobs in a one-shot manner. However, optimizing the completion time of <italic>current job</italic> alone is insufficient in a long-term scope since some datasets would be used multiple times. Instead, optimizing the data distribution is much more efficient and could directly benefit forthcoming jobs, although it may postpone the execution of current one. Unfortunately, due to the throwaway feature of data fetcher, existing data analytics systems fail to re-distribute corresponding data out of hotspot edges after the execution of data analytics. In order to minimize the overall completion time for a <italic>sequence</italic> of jobs as well as to guarantee the performance of current one, we propose to re-distribute the data along with task offloading, and formulate corresponding <inline-formula><tex-math notation="LaTeX">$\varepsilon$</tex-math> <alternatives><mml:math><mml:mi>ɛ</mml:mi></mml:math><inline-graphic xlink:href="qian-ieq2-3086274.gif"/></alternatives></inline-formula>-bounded data-driven task scheduling problem over wide area network under the consideration of edge heterogeneity. We design an online schema <inline-formula><tex-math notation="LaTeX">$run$</tex-math> <alternatives><mml:math><mml:mrow><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="qian-ieq3-3086274.gif"/></alternatives></inline-formula>Data, which offloads proper tasks and related data via piggybacking to the datacenter based on delicately calculated probabilities. Through rigorous theoretical analysis, <inline-formula><tex-math notation="LaTeX">$run$</tex-math> <alternatives><mml:math><mml:mrow><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="qian-ieq4-3086274.gif"/></alternatives></inline-formula>Data is proved concentrated on its optimum with high probability. We implement <inline-formula><tex-math notation="LaTeX">$run$</tex-math> <alternatives><mml:math><mml:mrow><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="qian-ieq5-3086274.gif"/></alternatives></inline-formula>Data based on Spark and HDFS. Both testbed results and trace-driven simulations show that <inline-formula><tex-math notation="LaTeX">$run$</tex-math> <alternatives><mml:math><mml:mrow><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="qian-ieq6-3086274.gif"/></alternatives></inline-formula>Data re-distributes proper data via piggybacking and achieves up to 37 percent reduction on average response time compared with state-of-the-art schemas.

SLDP: A Novel Data Placement Strategy for Large-Scale Heterogeneous Hadoop Cluster

LDPP: A Learned Directory Placement Policy in Distributed File Systems.

Distributed Affinity Propagation Clustering Based on MapReduce

An Optimized Learning-Based Directory Placement Policy with Two-Rounds Selection in Distributed File Systems

A Request Skew Aware Heterogeneous Distributed Storage System Based on Cassandra

Vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

A Holistic Heterogeneity-Aware Data Placement Scheme for Hybrid Parallel I/O Systems.

A Novel Data Placement Strategy for Data-Sharing Scientific Workflows in Heterogeneous Edge-Cloud Computing Environments

Location-Aware Data Block Allocation Strategy for HDFS-Based Applications in the Cloud

SLAS: An efficient approach to scaling round-robin striped volumes

Data placement in distributed data centers for improved SLA and network cost

Optimal Data Placement for Data-Sharing Scientific Workflows in Heterogeneous Edge-Cloud Computing Environments

Optimizing Hadoop Block Placement Policy and Cluster Blocks Distribution

Swarm Intelligence with a Chaotic Leader and a Salp algorithm: HDFS optimization for reduced latency and enhanced availability

Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture

Optimizing Data Partition for Scaling out Nosql Cluster

Application and Storage-Aware Data Placement and Job Scheduling for Hadoop Clusters.

Distributed Data Placement via Graph Partitioning

A MapReduce Cluster Deployment Optimization Framework with Geo-distributed Data.

Efficient. Scalable and Robust Data Shuffle Service for Distributed MapReduce Computing on Cloud