Abstract:MapReduce, the popular programming paradigm for large-scale data processing, has traditionally been deployed over tightly-coupled clusters where the data is already locally available. The assumption that the data and compute resources are available in a single central location, however, no longer holds for many emerging applications in commercial, scientific and social networking domains, where the data is generated in a geographically distributed manner. Further, the computational resources needed for carrying out the data analysis may be distributed across multiple data centers or community resources such as Grids. In this paper, we develop a modeling framework to capture MapReduce execution in a highly distributed environment comprising distributed data sources and distributed computational resources. This framework is flexible enough to capture several design choices and performance optimizations for MapReduce execution. We propose a model-driven optimization that has two key features: (i) it is end-to-end as opposed to myopic optimizations that may only make locally optimal but globally suboptimal decisions, and (ii) it can control multiple MapReduce phases to achieve low runtime, as opposed to single-phase optimizations that may control only individual phases. Our model results show that our optimization can provide nearly 82% and 64% reduction in execution time over myopic and single-phase optimizations, respectively. We have modified Hadoop to implement our model outputs, and using three different MapReduce applications over an 8-node emulated PlanetLab testbed, we show that our optimized Hadoop execution plan achieves 31-41% reduction in runtime over a vanilla Hadoop execution. Our model-driven optimization also provides several insights into the choice of techniques and execution parameters based on application and platform characteristics.

Efficient and Flexible Index Access in MapReduce.

An Efficient and Compact Indexing Scheme for Large-Scale Data Store.

The performance of MapReduce: an in-depth study

The Performance of MapReduce

Indexing multi-dimensional data in a cloud system.

Efficient B-tree Based Indexing for Cloud Data Processing.

Query optimization for massively parallel data processing.

RCFile: A Fast and Space-Efficient Data Placement Structure in MapReduce-based Warehouse Systems

E3: an Elastic Execution Engine for Scalable Data Processing.

Distributed data management using MapReduce

Optimizing MapReduce for Highly Distributed Environments

Adaptive Indexing for Distributed Array Processing

Efficiently extracting frequent subgraphs using MapReduce

Towards Efficient Subgraph Search In Cloud Computing Environments

EASE: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data.

MFIX: an Efficient and Reliable Index Advisor Via Multi-Fidelity Bayesian Optimization

MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters

Using Bitmap Index to Accelerate Accessing Large Scale Scientific Data on Demand

An effective 3-in-1 keyword search method over heterogeneous data sources

Optimization of service addition in multilevel index model for edge computing

Coordinate-based Efficient Indexing Mechanism for Intelligent IoT Systems in Heterogeneous Edge Computing