Abstract:Hadoop, especially Hadoop 2.0, has been a dominant framework for real-time big data processing. However, Hadoop is not optimized for energy efficiency. Aiming to solve this problem, in this paper, we propose a new framework to improve the energy efficiency of Hadoop 2.0. We focus on the resource manager in Hadoop 2.0, namely YARN, and propose energy-efficient task scheduling mechanisms on YARN. Particularly, we focus on CPU-intensive streaming jobs and classify streaming jobs into two types, namely batch streaming jobs ( i.e., a set of jobs are submitted simultaneously) and online streaming jobs ( i.e., jobs are continuously submitted one by one). We devise different energy-efficient task scheduling algorithms for each kind of streaming jobs. Specially, we first propose to abstractly model performance and energy consumption by considering the characteristics of tasks as well as the computational resources in YARN. Based on this model, we study the energy efficiency of streaming tasks which consist of the performance model and energy consumption model of task. We propose two key principles for improving energy efficiency: 1) CPU usage aware task allocation, partitions tasks to NMs based on the task characteristic in term of CPU usage; and 2) resource efficient task allocation, reduce idle resource. Then, we propose a D-based binning algorithm for the batch task scheduling and K-based binning algorithm for the online task scheduling that can adapt to continuously arriving tasks. We conduct extensive experiments on a real Hadoop 2.0 cluster and use two kinds of workloads to evaluate the performance and energy efficiency of our proposal. Compared with Storm ( the streaming data processing tool in Hadoop 2.0) and other approaches including TAPA and DVFS-MR, our proposal is more energy efficient. The batch task scheduling algorithm reduces up to 10 percent of energy consumption and keeps comparable performance. In addition, the online task scheduling algorithm reduces up to 7 percent over the existing algorithms.

Runtime-Aware Adaptive Scheduling in Stream Processing

Adaptive task scheduling in storm

An Efficient Fault-Tolerant Scheduling Algorithm for Periodic Real-Time Tasks in Heterogeneous Platforms

Communication-Efficient Task Scheduling for Real-Time Distributed Computing.

An On-the-Fly Scheduling Strategy for Distributed Stream Processing Platform.

Stromax: Partitioning-Based Scheduler For Real-Time Stream Processing System

An Adaptive Online Scheme for Scheduling and Resource Enforcement in Storm.

The Real-Time Scheduling Strategy Based on Traffic and Load Balancing in Storm

Adaptive Scheduling Framework of Streaming Applications based on Resource Demand Prediction with Hybrid Algorithms

Elastic Allocator: An Adaptive Task Scheduler for Streaming Query in the Cloud

Topology-Aware Task Allocation For Distributed Stream Processing With Latency Guarantee

A Cost-Efficient Scheduling Algorithm for Streaming Processing Applications on Cloud

Adaptive Scheduling for Efficient Execution of Dynamic Stream Workflows

A Task Scheduling Approach for Real-Time Stream Processing

Radar: Reducing Tail Latencies For Batched Stream Processing With Blank Scheduling

Energy-Efficient Task Scheduling for CPU-Intensive Streaming Jobs on Hadoop.

A state lossless scheduling strategy in distributed stream computing systems

Topology-aware task allocation for online distributed stream processing applications with latency constraints

Scheduling Storms and Streams in the Cloud

Adaptive Scheduling Parallel Jobs with Dynamic Batching in Spark Streaming

An Efficient Scheduling Algorithm for Stream Computing.