Greedy Transfer Planning Search for Improving Repair Throughput of RDP-like Coded Storage Clusters

Juehao Chen,Shiyi Li,Wen Xia,Shuaipeng Zhang,Qicong Lin,Haojun Hu
DOI: https://doi.org/10.1109/iwqos61813.2024.10682840
2024-01-01
Abstract:With the increasing scale of data and user demands for low latency, the development of large-scale clusters has become a trend. To ensure high availability of data in data clusters, XOR-based erasure code fault-tolerant technologies are widely used due to their low storage and computational overhead. Meanwhile, as the scale of clusters ranges from hundreds to thousands, the probability of multiple node failures is not negligible. This can lead to serious consequences, such as data loss, and should be recovered as soon as possible. However, codes such as RDP and EVENODD can easily lead to network congestion when recovering in the event of concurrent failures, making it challenging to recover quickly.To address this issue, we propose a novel network transfer plan search algorithm, Greedy Row-Diagonal Parity Search or GRS for short. GRS optimally allocates the network traffic generated during the repair process by greedily utilizing idle bandwidth and leveraging the commutative property of XOR operations, ensuring a more even distribution of traffic across the cluster network, which improves the repair throughput.We build a prototype in a distributed erasure-coded cluster and conduct experiment evaluation. The experimental results indicate that, compared to existing repair optimization methods, GRS improves repair throughput by 230%-880%.
What problem does this paper attempt to address?