Abstract:The MapReduce framework has become the de facto standard for big data processing due to its attractive features and abilities. One is that it automatically parallelizes a job into multiple tasks and transparently handles task execution on a large cluster of commodity machines. The increasing heterogeneity of distributed environments may result in a few straggling tasks, which prolong job completion. Speculative execution is proposed to mitigate stragglers. However, the existing speculative execution mechanism could not work efficiently as many speculative tasks are still slower than their original tasks. In this paper, we explore an approach to increase the efficiency of speculative execution, and further improve MapReduce performance. We propose the Partial Speculative Execution (PSE) strategy to make speculative tasks start from the checkpoint. By leveraging the checkpoint of original tasks, PSE can eliminate the costs of re-reading, re-copying, and re-computing the processed data. We implement PSE in Hadoop, and evaluate its performance in terms of job completion time and the efficiency of speculative execution under several kinds of classical workloads. Experimental results show that, in heterogeneous environments with stragglers, PSE completes jobs 56 % faster than that with no speculation and 12 % faster than that with LATE, an improved speculative execution algorithm. In addition, on average PSE can improve the efficiency of speculative execution by 24 % compared to LATE.

An Executable Specification of Map-Join-Reduce Using Haskell.

Improving MapReduce Performance with Partial Speculative Execution

Performance Evaluation for Distributed Join Based on MapReduce.

Towards Formalizing of MapReduce.

A Modeling Language for MapReduce Programing in a Storage System Perspective.

Reliable Estimation of Execution Time of MapReduce Program

DataMPI: Extending MPI to Hadoop-Like Big Data Computing

An Analytical Performance Model of MapReduce

Generate, test, and aggregate: a calculation-based framework for systematic parallel programming with mapreduce

Accumulative Computation on MapReduce

LLMapReduce: Multi-Level Map-Reduce for High Performance Data Analysis

A Semantic++ MapReduce Parallel Programming Model.

Towards Systematic Parallel Programming Over Mapreduce

Map-Balance-Reduce: an Improved Parallel Programming Model for Load Balancing of MapReduce.

Query optimization for massively parallel data processing.

A Semantic++ MapReduce: A Preliminary Report

Uncoupled MapReduce: A Balanced and Efficient Data Transfer Model

A methodology for high-level software specification construction.

An Evolutionary Development Model Supporting Executable Specification

Filter-embedding Semiring Fusion for Programming with MapReduce

A Specification for Typed Template Haskell