Abstract:As large-scale data analytic becomes norm in various industries, using MapReduce frameworks to analyze ever-increasing volumes of data will keep growing. In turn, this trend drives up the intention to move MapReduce into multi-tenant clouds. However, the application performance of MapReduce can be significantly affected by the time-varying network bandwidth in a shared cluster. Although many recent studies improve MapReduce performance by dynamic scheduling to reduce the shuffle traffic, most of them do not consider the impact by widely existing hierarchical network architectures in data centers. In this article, we propose and design a hierarchical topology (Hit) aware MapReduce scheduler to minimize overall data traffic cost and hence to reduce job execution time. We first formulate the problem as a Topology Aware Assignment (TAA) optimization problem while considering dynamic computing and communication resources in the cloud with hierarchical network architecture. We further develop a synergistic strategy to solve the TAA problem by using the stable matching theory, which ensures the preference of both individual tasks and hosting machines. Finally, we implement the proposed scheduler as a pluggable module on Hadoop YARN and evaluate its performance by testbed experiments and simulations. The testbed experimental results show that Hit-scheduler can improve job completion time by 28 and 11 percent compared to Capacity Scheduler and Probabilistic Network-Aware scheduler, respectively. Our simulations further demonstrate that Hit-scheduler can reduce the traffic cost by 38 percent at most and the average shuffle flow traffic time by 32 percent compared to capacity scheduler. In this article, we have extended Hit-scheduler to a decentralized heuristic scheme to perform the policy-aware allocation in data center environments. Many existing centralized approximation approaches are too complex and infeasible to implement over a data center, which typically include lar-e amounts of servers, containers, switches, and traffic flows. In the extension, we have designed a decentralized heuristic scheme to perform the Policy-Aware Task (PAT) allocation by using existing centralize algorithm to approximately maximize the total gained utility. Finally, the simulation based experimental results show that the proposed PAT policy reduces the communication cost by 33.6 percent compared with the default scheduler in data centers.

Minimizing Interference and Maximizing Progress for Hadoop Virtual Machines

Mimp: Deadline and Interference Aware Scheduling of Hadoop Virtual Machines

A User-Level NUMA-Aware Scheduler for Optimizing Virtual Machine Performance.

Energy-Efficient Hadoop Green Scheduler.

A virtual machine based task scheduling approach to improving data locality for virtualized Hadoop

Virtual Machine Scheduling Considering Both Computing and Cooling Energy

Vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

Scheduling Data Intensive Workloads through Virtualization on MapReduce based Clouds

An Efficient Power-Aware Resource Scheduling Strategy in Virtualized Datacenters

An Energy-Aware Heuristic Scheduling for Data-Intensive Workflows in Virtualized Datacenters

Analyzing & modeling the performance in Xen-based virtual cluster environment

Adaptive Disk I/O Scheduling for MapReduce in Virtualized Environment

Scheduling Overcommitted VM: Behavior Monitoring and Dynamic Switching-Frequency Scaling

Preemptive and Low Latency Datacenter Scheduling via Lightweight Containers

Efficient Hybrid Central Processing Unit/ Input–Output Resource Scheduling for Virtual Machines

Moving Hadoop into the Cloud with Flexible Slots

Communication-driven scheduling for virtual clusters in cloud.

Energy-Efficient Scheduling for Tasks with Deadline in Virtualized Environments

Network-Aware Re-Scheduling: Towards Improving Network Performance of Virtual Machines in a Data Center.

Distance-aware Virtual Cluster Performance Optimization: A Hadoop Case Study

Joint Optimization of MapReduce Scheduling and Network Policy in Hierarchical Data Centers