Accelerating erasure coding by exploiting multiple repair paths in distributed storage systems

Chanki Kim,Kang-Wook Chon
DOI: https://doi.org/10.1007/s10586-024-04438-y
2024-04-14
Cluster Computing
Abstract:High reliability must be ensured in distributed storage systems (DSSs) to maintain the stability of warehouse-scale computing and high-performance computing (HPC) systems. For system-level reliability, a repair operation using redundant storage nodes can be used in conjunction with erasure coding (EC), which can also affect the system performance. The existing EC design mainly focused on minimizing the required bandwidth for the repair and storage overheads. However, the computing performance for EC should be considered to achieve high bandwidth in order to exploit back-end network link capacity with heterogeneous and high-speed interconnects over 10 Gbps Ethernet. In this study, a new computing acceleration method for repair operation in EC is proposed using multiple repair paths and modifying the computation kernel on the graphics processing unit (GPU) device. For the Cauchy Reed–Solomon (CRS) codes, the proposed scheme is observed to achieve sufficient repair bandwidth compared to the theoretical bound or exceed the current maximum Ethernet link bandwidth.
computer science, information systems, theory & methods
What problem does this paper attempt to address?