NetEC: Accelerating Erasure Coding Reconstruction with In-Network Aggregation

Yi Qiao,Menghao Zhang,Yu Zhou,Xiao Kong,Han Zhang,Jun Bi,Mingwei Xu,Jilong Wang
DOI: https://doi.org/10.1109/tpds.2022.3145836
IF: 5.3
2022-01-01
IEEE Transactions on Parallel and Distributed Systems
Abstract:In distributed storage systems, Erasure Coding (EC) is a crucial technology to enable high data availability. By downloading parity data from survived machines, EC can reconstruct lost data with much lower storage overheads than data replication. However, this reduction in storage cost comes at the expense of extra performance problems: low reconstruction rate, high degraded read latency, and high host CPU utilization. Our analysis shows that these performance problems are deeply rooted in the host-based EC processing. To resolve these problems, we present NetEC, an in-network accelerating framework that fully offloads EC to the new generation programmable switching ASICs. We propose Explicit Buffer Size Notification (EBSN) to constrain decoding buffer usage, and design an on-switch one-to-many TCP proxy to integrate EBSN with TCP. We also design two parallel Galois Field (GF) offloading methods—table lookup and bitmatrix methods—to maximize parsable bytes. We implement NetEC on programmable switches and integrate it with HDFS. Extensive evaluations show that NetEC improves the reconstruction rate by 2.7x-6.8x, reduces the degraded read latency significantly, and removes the host CPU overhead completely. We also emulate multi-rack scenarios and show that NetEC is able to support $\sim$∼GB/s reconstruction rate and tens of concurrent tasks.
What problem does this paper attempt to address?