Abstract:Distributed infrastructures for computation and analytics are now evolving towards an interconnected ecosystem allowing complex scientific workflows to be executed across hybrid systems spanning from IoT Edge devices to Clouds, and sometimes to supercomputers (the Computing Continuum). Understanding the performance trade-offs of large-scale workflows deployed on such complex Edge-to-Cloud Continuum is challenging. To achieve this, one needs to systematically perform experiments, to enable their reproducibility and allow other researchers to replicate the study and the obtained conclusions on different infrastructures. This breaks down to the tedious process of reconciling the numerous experimental requirements and constraints with low-level infrastructure design <a class="link-external link-http" href="http://choices.To" rel="external noopener nofollow">this http URL</a> address the limitations of the main state-of-the-art approaches for distributed, collaborative experimentation, such as Google Colab, Kaggle, and Code Ocean, we propose KheOps, a collaborative environment specifically designed to enable cost-effective reproducibility and replicability of Edge-to-Cloud experiments. KheOps is composed of three core elements: (1) an experiment repository; (2) a notebook environment; and (3) a multi-platform experiment methodology.We illustrate KheOps with a real-life Edge-to-Cloud application. The evaluations explore the point of view of the authors of an experiment described in an article (who aim to make their experiments reproducible) and the perspective of their readers (who aim to replicate the experiment). The results show how KheOps helps authors to systematically perform repeatable and reproducible experiments on the Grid5000 + FIT IoT LAB testbeds. Furthermore, KheOps helps readers to cost-effectively replicate authors experiments in different infrastructures such as Chameleon Cloud + CHI@Edge testbeds, and obtain the same conclusions with high accuracies (> 88% for all performance metrics).

Reproducible Experiments for Comparing Apache Flink and Apache Spark on Public Clouds

A Benchmarking Study to Evaluate Apache Spark on Large-Scale Supercomputers

Benchmarking Distributed Stream Data Processing Systems

A Comprehensive Benchmarking Analysis of Fault Recovery in Stream Processing Frameworks

ShuffleBench: A Benchmark for Large-Scale Data Shuffling Operations with Distributed Stream Processing Frameworks

Apache Spark Streaming, Kafka and HarmonicIO: A Performance Benchmark and Architecture Comparison for Enterprise and Scientific Computing

Benchmarking scalability of stream processing frameworks deployed as microservices in the cloud

Technical Report: On the Usability of Hadoop MapReduce, Apache Spark & Apache Flink for Data Science

Evaluation of distributed data processing frameworks in hybrid clouds

Architectural Impact on Performance of In-memory Data Analytics: Apache Spark Case Study

Reproducible and Portable Big Data Analytics in the Cloud

ESPBench: The Enterprise Stream Processing Benchmark

Exploring Real-Time Data Processing Using Big Data Frameworks

Characterizing BigBench queries, Hive, and Spark in multi-cloud environments

Comparative Study on MapReduce and Spark for Big Data Analytics

Performance Analysis of Distributed Computing Frameworks for Big Data Analytics: Hadoop Vs Spark

A Comparative Study of Spark on the bare metal and Kubernetes

TR-Spark

KheOps: Cost-effective Repeatability, Reproducibility, and Replicability of Edge-to-Cloud Experiments

Design and implementation of reconfigurable acceleration for in-memory distributed big data computing.

Efficient Fuzz Testing for Apache Spark Using Framework Abstraction