Abstract:Efficiently analyzing geo-distributed datasets is emerging as a major demand in a cloud-edge system. Since the datasets are often generated in closer proximity to end users, traditional works mainly focus on offloading proper tasks from those hotspot edges to the datacenter to decrease the overall completion time of submitted jobs in a one-shot manner. However, optimizing the completion time of <italic>current job</italic> alone is insufficient in a long-term scope since some datasets would be used multiple times. Instead, optimizing the data distribution is much more efficient and could directly benefit forthcoming jobs, although it may postpone the execution of current one. Unfortunately, due to the throwaway feature of data fetcher, existing data analytics systems fail to re-distribute corresponding data out of hotspot edges after the execution of data analytics. In order to minimize the overall completion time for a <italic>sequence</italic> of jobs as well as to guarantee the performance of current one, we propose to re-distribute the data along with task offloading, and formulate corresponding <inline-formula><tex-math notation="LaTeX">$\varepsilon$</tex-math> <alternatives><mml:math><mml:mi>ɛ</mml:mi></mml:math><inline-graphic xlink:href="qian-ieq2-3086274.gif"/></alternatives></inline-formula>-bounded data-driven task scheduling problem over wide area network under the consideration of edge heterogeneity. We design an online schema <inline-formula><tex-math notation="LaTeX">$run$</tex-math> <alternatives><mml:math><mml:mrow><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="qian-ieq3-3086274.gif"/></alternatives></inline-formula>Data, which offloads proper tasks and related data via piggybacking to the datacenter based on delicately calculated probabilities. Through rigorous theoretical analysis, <inline-formula><tex-math notation="LaTeX">$run$</tex-math> <alternatives><mml:math><mml:mrow><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="qian-ieq4-3086274.gif"/></alternatives></inline-formula>Data is proved concentrated on its optimum with high probability. We implement <inline-formula><tex-math notation="LaTeX">$run$</tex-math> <alternatives><mml:math><mml:mrow><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="qian-ieq5-3086274.gif"/></alternatives></inline-formula>Data based on Spark and HDFS. Both testbed results and trace-driven simulations show that <inline-formula><tex-math notation="LaTeX">$run$</tex-math> <alternatives><mml:math><mml:mrow><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="qian-ieq6-3086274.gif"/></alternatives></inline-formula>Data re-distributes proper data via piggybacking and achieves up to 37 percent reduction on average response time compared with state-of-the-art schemas.

R2D2: Reducing Redundancy and Duplication in Data Lakes

DROLAP - A Dense-Region Based Approach to On-Line Analytical Processing

Reproducible data science over data lakes: replayable data pipelines with Bauplan and Nessie

Towards Optimizing Storage Costs on the Cloud

LakeBench: Benchmarks for Data Discovery over Data Lakes

A Big Data Lake for Multilevel Streaming Analytics

The Data Lakehouse: Data Warehousing and More

Enhancing Dependability in Big Data Analytics Enterprise Pipelines

Retrieve, Merge, Predict: Augmenting Tables with Data Lakes

Scalable Architecture for Personalized Healthcare Service Recommendation using Big Data Lake

Integrating Data Lake Tables

A Review on Data Lake

Delta Tensor: Efficient Vector and Tensor Storage in Delta Lake

Benchmarking Data Lakes Featuring Structured and Unstructured Data with DLBench

Rethinking Storage Management for Data Processing Pipelines in Cloud Data Centers

Leveraging Oil and Gas Data Lakes to Enable Data Science Factories

Why TPC is Not Enough: An Analysis of the Amazon Redshift Fleet

Searching Data Lakes for Nested and Joined Data

SparkDWM: a scalable design of a Data Washing Machine using Apache Spark

Partition, Don't Sort! Compression Boosters for Cloud Data Ingestion Pipelines