Optimizing Erasure-Coded Data Archival for Replica-Based Storage Clusters

Jianzhong Huang,Panping Zhou,Xiao Qin,Yanqun Wang,Changsheng Xie
DOI: https://doi.org/10.1093/comjnl/bxy079
2018-01-01
The Computer Journal
Abstract:For the sake of cost-effectiveness, it is a conventional wisdom to employ (k + r,k) erasure codes to archive rarely accessed replicas, i.e. erasure-coded data archival. Existing researches on erasure-coded data archival optimizations are mainly aimed to reduce archival traffic within storage clusters. Apart from archival traffic, both non-sequential reads and imbalanced loads can deteriorate archival performance. Traditional distributed archival schemes (DArch for short) for randomly distributed replicas tend to suffer from two problems: (i) non-sequential reads because underlying file systems split a data block into multiple smaller data chunks and (ii) imbalanced loads since archival tasks are assigned according to data locality of replicas. To overcome such drawbacks, we incorporate both prefetching mechanism and balancing strategy into erasure-coded archival for replica-based storage clusters, and propose three new archival schemes: a prefetching-enabled archival scheme (i.e. P-DArch), a balancing-enabled archival scheme (i.e. B-DArch) and a prefetching-and-balancing-enabled archival scheme (i.e. PB-DArch). We implement a proof-of-concept prototype, where all the four archival schemes are deployed and quantitatively evaluated. The experimental results show that both the prefetching mechanism and balancing strategy can effectively optimize archival performance of a replica-based storage cluster exhibiting a random data layout. In a (12,9) RS-coded archival scenario, P-DArch, B-DArch and PB-DArch outperform DArch by a factor of 2.95, 1.72 and 3.85, respectively.
What problem does this paper attempt to address?