Deterministic Data Distribution for Efficient Recovery in Erasure-Coded Storage Systems

Liangliang Xu,Min Lyu,Zhipeng Li,Yongkun Li,Yinlong Xu
DOI: https://doi.org/10.1109/tpds.2020.2987837
IF: 5.3
2020-10-01
IEEE Transactions on Parallel and Distributed Systems
Abstract:Due to individual unreliable commodity components, failures are common in large-scale distributed storage systems. Erasure codes are widely deployed in practical storage systems to provide fault tolerance with low storage overhead. However, random data distribution (RDD), commonly used in erasure-coded storage systems, induces heavy cross-rack traffic, load imbalance, and random access, which adversely affects failure recovery. In this article, with orthogonal arrays, we define a Deterministic Data Distribution ($D^3$<math>D3</math>) to uniformly distribute data/parity blocks among nodes, and propose an efficient failure recovery approach based on $D^3$<math>D3</math>, which minimizes the cross-rack repair traffic against a single node failure. Thanks to the uniformity of $D^3$<math>D3</math>, the proposed recovery approach balances the repair traffic not only among nodes within a rack but also among racks. We implement $D^3$<math>D3</math> over Reed-Solomon codes and Locally Repairable Codes in Hadoop Distributed File System (HDFS) with a cluster of 28 machines. Compared with RDD, our experiments show that $D^3$<math>D3</math> significantly speeds up the failure recovery up to 2.49 times for RS codes and 1.38 times for LRCs. Moreover, $D^3$<math>D3</math> supports front-end applications better than RDD in both of normal and recovery states.
computer science, theory & methods,engineering, electrical & electronic
What problem does this paper attempt to address?