Abstract:MapReduce, the popular programming paradigm for large-scale data processing, has traditionally been deployed over tightly-coupled clusters where the data is already locally available. The assumption that the data and compute resources are available in a single central location, however, no longer holds for many emerging applications in commercial, scientific and social networking domains, where the data is generated in a geographically distributed manner. Further, the computational resources needed for carrying out the data analysis may be distributed across multiple data centers or community resources such as Grids. In this paper, we develop a modeling framework to capture MapReduce execution in a highly distributed environment comprising distributed data sources and distributed computational resources. This framework is flexible enough to capture several design choices and performance optimizations for MapReduce execution. We propose a model-driven optimization that has two key features: (i) it is end-to-end as opposed to myopic optimizations that may only make locally optimal but globally suboptimal decisions, and (ii) it can control multiple MapReduce phases to achieve low runtime, as opposed to single-phase optimizations that may control only individual phases. Our model results show that our optimization can provide nearly 82% and 64% reduction in execution time over myopic and single-phase optimizations, respectively. We have modified Hadoop to implement our model outputs, and using three different MapReduce applications over an 8-node emulated PlanetLab testbed, we show that our optimized Hadoop execution plan achieves 31-41% reduction in runtime over a vanilla Hadoop execution. Our model-driven optimization also provides several insights into the choice of techniques and execution parameters based on application and platform characteristics.

Query optimization for massively parallel data processing.

Join Query Optimization Based on MapReduce under Skewed Data

Optimization Factor Analysis Of Large-Scale Join Queries On Different Platforms

The performance of MapReduce: an in-depth study

The Performance of MapReduce

AQUA+: Query Optimization for Hybrid Database-MapReduce System.

Accelerating Apache Hive with MPI for Data Warehouse Systems

Distributed data management using MapReduce

Optimizing MapReduce for Highly Distributed Environments

Query grouping-based multi-query optimization framework for interactive SQL query engines on Hadoop.

MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters

Large-scale Data Modelling in Hive and Distributed Query Processing using MapReduce and Tez

Super Rack: Reusing the Results of Queries in MapReduce Systems

Efficient Multi-way Theta-Join Processing Using MapReduce

MapReduce Job Optimization: A Mapping Study

Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing

Query and Resource Optimizations: A Case for Breaking the Wall in Big Data Systems

Vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

Cost-Based Optimization Of Logical Partitions For A Query Workload In A Hadoop Data Warehouse

Reusing the Results of Queries in MapReduce Systems by Adopting Shared Storage.

Optimization for Iterative Queries on MapReduce