Towards a Better Replica Management for Hadoop Distributed File System.

Hilmi Egemen Ciritoglu,Takfarinas Saber,Teodora Sandra Buda,John Murphy,Christina Thorpe
DOI: https://doi.org/10.1109/bigdatacongress.2018.00021
2018-01-01
Abstract:The Hadoop Distributed File System (HDFS) is the storage of choice when it comes to large-scale distributed systems. In addition to being efficient and scalable, HDFS provides high throughput and reliability through the replication of data. Recent work exploits this replication feature by dynamically varying the replication factor of in-demand data as a means of increasing data locality and achieving a performance improvement. However, to the best of our knowledge, no study has been performed on the consequences of varying the replication factor. In particular, our work is the first to show that although HDFS deals well with increasing the replication factor, it experiences problems with decreasing it. This leads to unbalanced data, hot spots, and performance degradation. In order to address this problem, we propose a new workload-aware balanced replica deletion algorithm. We also show that our algorithm successfully maintains the data balance and achieves up to 48% improvement in execution time when compared to HDFS, while only creating an overhead of 1.69% on average.
What problem does this paper attempt to address?