Abstract:MapReduce emerges as an important distributed programming paradigm for large-scale data analysis applications. As an open-source implementation of MapReduce, Hadoop presents an attractive usage system for many enterprises. There are some drawbacks in a traditional Hadoop cluster deployed with a large scale of physical machines, such as burdensome cluster management and fluctuating resource utilization. Virtualized Hadoop cluster not only simplifies cluster management, but also facilitates cost-effective workload consolidation for resource utilization. In Hadoop system, the data locality is a critical factor impacting on performance of MapReduce applications. However, existing task scheduling approaches to improving data locality of virtualized Hadoop are not effective because of two levels distribution of data: virtual machines and physical servers. In this paper, we deploy virtualized Hadoop cluster in which computing node and storage node are placed in respective virtual machines to improve flexibility. We propose a novel task scheduling approach which aims to improve data locality for virtualized Hadoop cluster through migrating the virtual machine acted as computing node to the physical server running virtual machine acted as storage node that holds a data replica needed by that computing node. We evaluated our approach's efficiency on a virtualized Hadoop cluster with the aforementioned deployment for 11 computing nodes and 12 storage nodes. Our experiment results show that our approach improves performance of 86% typical MapReduce applications in our benchmark suite at varying degrees.

Evaluating MapReduce on Virtual Machines: the Hadoop Case.

The performance of MapReduce: an in-depth study

The Performance of MapReduce

A Framework to Evaluate and Predict Performances in Virtual Machines Environment

Cloudlet: Towards Mapreduce Implementation On Virtual Machines

Location-Aware MapReduce in Virtual Cloud

Vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

A virtual machine based task scheduling approach to improving data locality for virtualized Hadoop

Vlocality: Revisiting Data Locality for MapReduce in Virtualized Clouds

Analyzing & modeling the performance in Xen-based virtual cluster environment

An Energy-Efficient VM Placement in Cloud Datacenter

Scalability Analysis and Improvement of Hadoop Virtual Cluster with Cost Consideration

Hsim: A Mapreduce Simulator In Enabling Cloud Computing

Virtual Machine Based Energy-Efficient Data Center Architecture for Cloud Computing: A Performance Perspective.

Research On Performance Comparison Of Data Center Between Pm And Vm

Performance Combinative Evaluation From Single Virtual Machine To Multiple Virtual Machine Systems

A New MapReduce Framework Based on Virtual IP Mechanism and Load Balancing Strategy

Query optimization for massively parallel data processing.

Practical Verifiable Computation–A MapReduce Case Study

Diagnosing Virtualized Hadoop Performance from Benchmark Results: An Exploratory Study

Evaluating I/O Scheduler in Virtual Machines for Mapreduce Application