Abstract:Due to the batch processing capability and distributed storage, MapReduce and HDFS have always been the core parts of Hadoop. Nowadays, many studies still focus on improving and optimizing of the MapReduce task scheduling algorithms. However, in terms of real-time processing, MapReduce task scheduling algorithms cannot perform very well. In this paper, we design a real-time approach based on Hadoop application framework that has practical value in the field of real-time processing. Then, we put forward real-time scheduling algorithms for the file storage layer and computing layer in the framework. The motivation of this paper is to put forward real-time processing algorithms of Hadoop to deal with big data analysis issues. For the file storage layer, we propose a data management algorithm strategy named HDMA. The algorithm considers a variety of factors such as load balancing, data-locality and the heterogeneous properties of each machine within the heterogeneous cluster. HDMA can significantly improve the computing performance of the whole cluster. For the computing layer, we propose a resource dynamic allocation scheduling algorithm based on the length of job named LERDA. LERDA divides various jobs into several levels according to the length of each job and thus avoids short job waiting for a long time. Experiments show that (a) our algorithms increase the data-locality by 15% compared to other algorithms, (b) our algorithms improve the real-time performance of map by approximately 20% in the best case, (c) LERDA can prevent short jobs from suffering starvation effects.

A Scheduling Strategy Based on Multi-Queues of Cassandra.

A Request Skew Aware Heterogeneous Distributed Storage System Based on Cassandra

Schema-Driven Performance Evaluation for Highly Concurrent Scenarios.

Performance-Driven Task and Data Co-scheduling Algorithms for Data-Intensive Applications in Grid Computing

A Load Balancing Strategy Based on Data Correlation in Cloud Computing

Optimization of Big Data Parallel Scheduling Based on Dynamic Clustering Scheduling Algorithm

Hadoop Scheduling Base On Data Locality

MQWAGS: Research on Job Scheduling Algorithms Based on Cloud Computing

QoS-aware and multi-objective virtual machine dynamic scheduling for big data centers in clouds

Priority Task Scheduling Strategy for Heterogeneous Multi-Datacenters in Cloud Computing

A User-Level NUMA-Aware Scheduler for Optimizing Virtual Machine Performance.

Distributed Cache Memory Data Migration Strategy Based on Cloud Computing

Adaptive Scheduling Strategy For Data Stream Management System

Adaptive priority-based data placement and multi-task scheduling in geo-distributed cloud systems

A Near Optimal Multi-Faced Job Scheduler For Datacenter Workloads

Low Complexity Hierarchical Scheduling for Diverse Datacenter Jobs.

Design of A More Scalable Database System

A hunger-based scheduling strategy for distributed crawler

Heterogeneous Replicas for Multi-dimensional Data Management

A Real-Time Scheduling Strategy Based on Processing Framework of Hadoop

Location-Aware Data Block Allocation Strategy for HDFS-Based Applications in the Cloud