Boosting Correlated Failure Repair in SSD Data Centers

Junmei Chen,Zongpeng Li,Qifu Tyler Sun,Ne Wang,Lina Su
DOI: https://doi.org/10.1109/jiot.2023.3339979
IF: 10.6
2024-01-01
IEEE Internet of Things Journal
Abstract:Current data centers rely on failure protection mechanisms to ensure data reliability. However, recent research indicates that failures within the same node or rack are common in data centers that use flash-based solid-state drives (SSDs) as the primary storage medium. Such correlated failures bring challenges for traditional protection mechanisms to achieve high reliability and repair performance. To this end, we propose a product erasure code (PECode) that encodes data blocks in multiple stripes cooperatively to generate intra-stripe and inter-stripe parity blocks. Then, we design a multi-stripe cooperative repair algorithm (MSCRepair). MSCRepair first creates the failure distribution matrix (FDM) to represent the distribution of failure blocks in nodes and racks, and then conducts FDM-guided repair to minimize cross-rack traffic upon correlated failures. We prove that MSCRepair achieves the least cross-rack repair traffic at the cost of a longer repair time. We further propose a correlated failure repair scheduling algorithm for MSCRepair, which reduces the repair time by balancing the load and delivering data from links with higher bandwidths. We evaluate MSCRepair through both large-scale simulations and real experiments. In the mise-en-scene of its state-of-the-art alternatives, MSCRepair stands out by reducing up to 19.6% ~ 49.9% of cross-rack traffic, while simultaneously reducing 16.2% ~ 51.4% of recovery time of correlated failures.
What problem does this paper attempt to address?