Abstract:MapReduce has been widely used for large-scale data analysis in the Cloud. The system is well recognized for its elastic scalability and fine-grained fault tolerance although its performance has been noted to be suboptimal in the database context. According to a recent study [19], Hadoop, an open source implementation of MapReduce, is slower than two state-of-the-art parallel database systems in performing a variety of analytical tasks by a factor of 3.1 to 6.5. MapReduce can achieve better performance with the allocation of more compute nodes from the cloud to speed up computation; however, this approach of "renting more nodes" is not cost effective in a pay-as-you-go environment. Users desire an economical elastically scalable data processing system, and therefore, are interested in whether MapReduce can offer both elastic scalability and efficiency. In this paper, we conduct a performance study of MapReduce (Hadoop) on a 100-node cluster of Amazon EC2 with various levels of parallelism. We identify five design factors that affect the performance of Hadoop, and investigate alternative but known methods for each factor. We show that by carefully tuning these factors, the overall performance of Hadoop can be improved by a factor of 2.5 to 3.5 for the same benchmark used in [19], and is thus more comparable to that of parallel database systems. Our results show that it is therefore possible to build a cloud data processing system that is both elastically scalable and efficient.

Micro-blogs Data Collection Based on MapReduce

A Distributed Data Mining System Framework for Mobile Internet Access Log Based on Hadoop.

Large scale microblog mining using distributed MB-LDA.

A System to Manage and Mine Microblogging Data.

Analysis of Plant Breeding on Hadoop and Spark

Large-Scale Social Network Analysis Based on MapReduce

A Public Opinion Monitoring System Based on Big Data Technology

Collecting, Managing and Analyzing Social Networking Data Effectively

Parallel Approach and Platform for Large-Scale WEB Data Extraction

Design and Implementation of Microblog Graph data Management and Storage Platform Based on NoSQL

Vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

Real-Time Search over a Microblogging System

Research on method for extracting large-scale social network based on Mapreduce

The performance of MapReduce: an in-depth study

The grand information flows in micro-blog

RESEARCH ON GIS MASSIVE TRAFFIC DATA ANALYSIS PLATFORM BASED ON HADOOP

A parallel clustering algorithm for logs data based on Hadoop platform

Mavis: A Multiple Microblogs Analysis And Visualization Tool

Evaluating Large Graph Processing in MapReduce Based on Message Passing

Design and Research of Web Crawler Based on Distributed Architecture

MicroScholar: Mining Scholarly Information from Chinese Microblogs.