A Parallel Partial Merge Repair Algorithm for Multi-block Failures for Erasure Storage Systems

Shuaipeng Zhang,Shiyi Li,Chentao Wu,Ruobin Wu,Saiqin Long,Wen Xia
DOI: https://doi.org/10.1109/ipdps57955.2024.00060
2024-01-01
Abstract:In order to achieve high availability and low storage costs in distributed storage systems, erasure code is widely used instead of replication. Compared to replication, erasure code can reduce storage costs, but also brings higher repair costs. There are currently many repair algorithms to reduce the block reconstruction time of single block failure. However, applying the existing methods to multi-block failures may lead to unbalanced network traffic, unnecessary network transfers, and network congestion at data collection node during the repair process, which can not make full use of the bandwidth between nodes.To solve this problem, we propose a novel repair algorithm called Partial Merge Repair (PMR) for multi-block failures, which is a scheduling algorithm that considers network load between nodes and combines multiple failed blocks to recover together. It first divides all surviving nodes into different groups, and then the data collection nodes within the group collect the data needed to repair multiple blocks through cross merging. Finally, the data collection node sends the collected blocks to the repair node to complete the repair. Our study presents a formal definition and proof of network transfer time in the modeled repair process of PMR, highlighting its superior efficiency compared to existing methods in homogeneous environments.We implement a prototype of PMR to evaluate its performance. The experimental results indicate that compared to existing repair technologies, PMR improves repair throughput by 28%-256% for various scenes.
What problem does this paper attempt to address?