Abstract:Data analysis is an important functionality in cloud computing which allows a huge amount of data to be processed over very large clusters. MapReduce is recognized as a popular way to handle data in the cloud environment due to its excellent scalability and good fault tolerance. However, compared to parallel databases, the performance of MapReduce is slower when it is adopted to perform complex data analysis tasks that require the joining of multiple data sets in order to compute certain aggregates. A common concern is whether MapReduce can be improved to produce a system with both scalability and efficiency. In this paper, we introduce Map-Join-Reduce, a system that extends and improves MapReduce runtime framework to efficiently process complex data analysis tasks on large clusters. We first propose a filtering-join-aggregation programming model, a natural extension of MapReduce's filtering-aggregation programming model. Then, we present a new data processing strategy which performs filtering-join-aggregation tasks in two successive MapReduce jobs. The first job applies filtering logic to all the data sets in parallel, joins the qualified tuples, and pushes the join results to the reducers for partial aggregation. The second job combines all partial aggregation results and produces the final answer. The advantage of our approach is that we join multiple data sets in one go and thus avoid frequent checkpointing and shuffling of intermediate results, a major performance bottleneck in most of the current MapReduce-based systems. We benchmark our system against Hive, a state-of-the-art MapReduce-based data warehouse on a 100-node cluster on Amazon EC2 using TPC-H benchmark. The results show that our approach significantly boosts the performance of complex analysis queries.

Join Query Optimization Based on MapReduce under Skewed Data

Performance Evaluation for Distributed Join Based on MapReduce.

Optimization Factor Analysis Of Large-Scale Join Queries On Different Platforms

Query optimization for massively parallel data processing.

An Adaptive Skew Handling Join Algorithm for Large-scale Data Analysis

A cost aware adaptive multiple table join evaluation in MapReduce

MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters

The performance of MapReduce: an in-depth study

The Performance of MapReduce

Effective Spatial Data Partitioning for Scalable Query Processing

Efficient Multi-way Theta-Join Processing Using MapReduce

An Effective High-Performance Multiway Spatial Join Algorithm with Spark

Improving Distributed Similarity Join in Metric Space with Error-bounded Sampling

An Efficient MapReduce Algorithm for Similarity Join in Metric Spaces

Utilizing the column imprints to accelerate no‐partitioning hash joins in large‐scale edge systems

A Real-Time Partition Generation Mechanism for Data Skew Mitigation in Spark Computing Environment

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Scalable Parallel Join for Huge Tables

Cost-Based Optimization Of Logical Partitions For A Query Workload In A Hadoop Data Warehouse

Heads-Join: Efficient Earth Mover's Distance Similarity Joins on Hadoop

An adaptive skew insensitive join algorithm for large scale data analytics