Abstract:As MapReduce is becoming ubiquitous in large-scale data analysis, many recent studies have shown that the performance of MapReduce could be improved by different job scheduling approaches, e.g., Fair Scheduler and Capacity Scheduler. However, most exiting MapReduce job schedulers focus on the scenario that MapReduce cluster is stable and pay little attention to the MapReduce cluster with dynamic resource availability. In fact, MapReduce cluster resources may fluctuate as there is a growing number of Hadoop clusters deployed on hybrid systems, e.g., infrastructure powered by mix of traditional and renewable energy, and cloud platforms hosting heterogeneous workloads. Thus, there is a growing need for providing predictable services to users who have strict requirements on job completion times in such dynamic environments. In this paper, we propose, RDS, a Resource and Deadline-aware Hadoop job Scheduler that takes future resource availability into consideration when minimizing job deadline misses. We formulate the job scheduling problem as an online optimization problem and solve it using an efficient receding horizon control algorithm. To aid the control, we design a self-learning model to estimate job completion times. We further extend the design of RDS scheduler to support flexible performance goals in various dynamic clusters. In particular, we use flexible deadline time bounds instead of the single fixed job completion deadline. We have implemented RDS in the open-source Hadoop implementation and performed evaluations with various benchmark workloads. Experimental results show that RDS substantially reduces the penalty of deadline misses by at least 36 and 10 percent compared with Fair Scheduler and Earliest Deadline First (EDF) scheduler, respectively. In a Hadoop cluster running partially on renewable energy, the experimental result shows the green power based resource prediction approach can further reduce the penalty of deadline misses by 16 percent compared to Auto-Regressive Integrated Moving Average (ARIMA) prediction approach.

Stage Delay Scheduling: Speeding up DAG-style Data Analytics Jobs with Resource Interleaving

Stage Delay Scheduling

DAG-Aware Optimization for Geo-Distributed Data Analytics.

Argus: Efficient Job Scheduling in RDMA-assisted Big Data Processing

Performance Improvement of DAG-Aware Task Scheduling Algorithms with Efficient Cache Management in Spark

A Real-Time Scheduling Strategy Based on Processing Framework of Hadoop

A Delay Scheduling Algorithm Based on History Time in Heterogeneous Environments

A Near Optimal Multi-Faced Job Scheduler For Datacenter Workloads

Do the Hard Stuff First: Scheduling Dependent Computations in Data-Analytics Clusters

Task Scheduling for Spark Applications with Data Affinity on Heterogeneous Clusters.

Big Data Processing Workflows Oriented Real-Time Scheduling Algorithm using Task-Duplication in Geo-Distributed Clouds

Joint Scheduling of Tasks and Network Flows in Big Data Clusters

A DAG Refactor Based Automatic Execution Optimization Mechanism for Spark.

Adaptive Scheduling Framework of Streaming Applications based on Resource Demand Prediction with Hybrid Algorithms

Scheduling Spark Tasks with Data Skew and Deadline Constraints

Resource and Deadline-Aware Job Scheduling in Dynamic Hadoop Clusters

A Two-Stage Scheduling Method for Deadline-Constrained Task in Cloud Computing

Study on Adaptive Delay Schedule Algorithm Based on Progress Control of Hadoop

Dag-Based Parallel Real Time Task Scheduling Algorithm On A Cluster

Deadline-Aware MapReduce Job Scheduling with Dynamic Resource Availability

Beamer: Stage-Aware Coflow Scheduling to Accelerate Hyper-Parameter Tuning in Deep Learning Clusters