Abstract:In this paper, we present WOHA, an efficient scheduling framework for deadline-aware Map-Reduce workflows. In data centers, complex backend data analysis often utilizes a workflow that contains tens or even hundreds of interdependent Map-Reduce jobs. Meeting deadlines of these workflows is usually of crucial importance to businesses (for example, workflows tightly linked to time-sensitive advertisement placement optimizations can directly affect revenue). Popular Map-Reduce implementations, such as Hadoop, deal with independent Map-Reduce jobs rather than workflows of jobs. In order to simplify the process of submitting workflows, solutions like Oozie emerge, which take a workflow configuration file as input and automatically submit its Hadoop jobs at the right time. The information separation that Hadoop only handles resource allocation and Oozie workflow topology, although preventing the Hadoop master node from getting involved with complex workflow analysis, may unnecessarily lengthen the workflow spans and thus cause more deadline misses. To address this problem and at the same time honor the efficiency of Hadoop master node, WOHA allows client nodes to locally generate scheduling plans which are later used as resource allocation hints by the master node. Under this framework design, we propose a novel scheduling algorithm that improves deadline satisfaction ratio by dynamically assigning priorities among workflows based on their progresses. We implement WOHA by extending Hadoop-1.2.1. Our experiments over an 80-server cluster show that WOHA manages to increase the deadline satisfaction ratio by 10% compared to state-of-the-art solutions, and scales up to tens of thousands of concurrently running workflows.

A Real-Time Scheduling Strategy Based on Processing Framework of Hadoop

TATA: Throughput-Aware TAsk Placement in Heterogeneous Stream Processing with Deep Reinforcement Learning

Design and Implementation of a Real-time Scheduling Algorithm for MapReduce

Communication-Efficient Task Scheduling for Real-Time Distributed Computing.

Performance optimization of computing task scheduling based on the Hadoop big data platform

A Task Scheduling Approach for Real-Time Stream Processing

An On-the-Fly Scheduling Strategy for Distributed Stream Processing Platform.

Big Data Processing Workflows Oriented Real-Time Scheduling Algorithm using Task-Duplication in Geo-Distributed Clouds

New Scheduling Algorithms for Improving Performance and Resource Utilization in Hadoop YARN Clusters

Vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

Real-Time Big Data Processing Framework: Challenges and Solutions

Dynamic Task Scheduler for Real Time Requirement in Cloud Computing System.

WOHA: Deadline-Aware Map-Reduce Workflow Scheduling Framework over Hadoop Clusters

Hadoop Scheduling Base On Data Locality

Adaptive Scheduling Framework of Streaming Applications based on Resource Demand Prediction with Hybrid Algorithms

Optimization of Big Data Parallel Scheduling Based on Dynamic Clustering Scheduling Algorithm

Research on the Framework and Resource Scheduling Mechanisms of Hadoop YARN

Optimizing Internal Overlaps by Self-Adjusting Resource Allocation in Multi-Stage Computing Systems

A Deadline-Aware Coflow Scheduling Approach for Big Data Applications.

Improved Hungarian algorithm–based task scheduling optimization strategy for remote sensing big data processing

Improving Scheduling Efficiency of Hadoop YARN Using AFSA Algorithm.