Location-Aware Data Block Allocation Strategy for HDFS-Based Applications in the Cloud

Hua Xu,Weiqing Liu,Guansheng Shu,Jing Li
DOI: https://doi.org/10.1109/cloud.2016.0042
2016-01-01
Abstract:Big data processing applications have been migrated into cloud gradually, due to the advantages of cloud computing. Hadoop Distributed File System (HDFS), is one of the fundamental support systems for big data processing on MapReduce-like frameworks, such as Hadoop and Spark. However, the default block allocation scheme of HDFS doesn't fit well in the cloud environments behaving in two aspects: data reliability loss and performance degradation, because HDFS in cloud is not aware of the co-location of virtual machines. It leads to a situation that multiple same replicas of file blocks may be allocated in a same physical machine though in different virtual machines, which harms the data reliability. Besides, it also leads to excessive remote task executions, which causes the performance degradation.In this paper, we propose a novel location-aware data block allocation strategy aiming at solving these problems. This strategy allocates data blocks according to the locations and different processing capacities of virtual nodes in the cloud. We implemented our strategy into an actual Hadoop cluster and evaluated the performance with the benchmark suite BigDataBench. The experimental results show that our strategy can guarantee the designed data reliability while reducing task execution time of Hadoop applications by 8.9% on average and up to 11.2% compared with the original Hadoop in cloud. Since the data block allocation of HDFS is a fundamental function, we believe the proposed strategy also can benefit Spark and other HDFS-based applications.
What problem does this paper attempt to address?