Abstract:Big data processing applications have been migrated into cloud gradually, due to the advantages of cloud computing. Hadoop Distributed File System (HDFS), is one of the fundamental support systems for big data processing on MapReduce-like frameworks, such as Hadoop and Spark. However, the default block allocation scheme of HDFS doesn't fit well in the cloud environments behaving in two aspects: data reliability loss and performance degradation, because HDFS in cloud is not aware of the co-location of virtual machines. It leads to a situation that multiple same replicas of file blocks may be allocated in a same physical machine though in different virtual machines, which harms the data reliability. Besides, it also leads to excessive remote task executions, which causes the performance degradation.In this paper, we propose a novel location-aware data block allocation strategy aiming at solving these problems. This strategy allocates data blocks according to the locations and different processing capacities of virtual nodes in the cloud. We implemented our strategy into an actual Hadoop cluster and evaluated the performance with the benchmark suite BigDataBench. The experimental results show that our strategy can guarantee the designed data reliability while reducing task execution time of Hadoop applications by 8.9% on average and up to 11.2% compared with the original Hadoop in cloud. Since the data block allocation of HDFS is a fundamental function, we believe the proposed strategy also can benefit Spark and other HDFS-based applications.

Location-Aware Data Block Allocation Strategy for HDFS-Based Applications in the Cloud

Qos-Aware Indiscriminate Volume Storage Cloud

Location-Aware MapReduce in Virtual Cloud

QoSC: A QoS-Aware Storage Cloud Based on HDFS

A Load Balancing Strategy Based on Data Correlation in Cloud Computing

Fuzzy Clustering with Feature Weight Preferences for Load Balancing in Cloud

AI-oriented Workload Allocation for Cloud-Edge Computing.

A Policy of Task Allocation Base on Distributed Cluster Computing Towards Cloud

A distributed storage method of remote sensing data based on image blocks organization

The performance of MapReduce: an in-depth study

The Performance of MapReduce

A New Block-Based Data Distribution Mechanism in Cloud Computing

Moving Hadoop into the Cloud with Flexible Slot Management and Speculative Execution

QoS-Aware Data Placement for MapReduce Applications in Geo-Distributed Data Centers

Efficient Spatial Big Data Storage and Query in HBase.

Optimizing Hadoop Block Placement Policy and Cluster Blocks Distribution

Vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

Adaptive priority-based data placement and multi-task scheduling in geo-distributed cloud systems

Analysis of Big Data Platform with OpenStack and Hadoop.

A Resource Co-Allocation Method for Load-Balance Scheduling over Big Data Platforms

Weight-based strategy for an I/O-intensive application at a cloud data center.