HGR: A Hybrid Global Graph-Based Recovery Approach for Cloud Storage Systems with Failure and Straggler Nodes

Piao Hui,Huangzhen Xue,Chentao Wu,Minyi Guo,Jie Li,Xiangyu Chen,Shaoteng Liu,Liyang Zhou,Shenghong Xie
DOI: https://doi.org/10.1109/icdcs60910.2024.00075
2024-01-01
Abstract:Cloud storage systems often face the issues of failure and straggler nodes. Failure is characterized as a fail-stop scenario, which refers to disk failures that can result in significant data unavailability. Straggler nodes are typically those with heavy workloads or poor performance. Usually, both failure and straggler nodes coexist, posing a significant challenge to data availability in storage systems. In such failure scenarios, parallel recovery and straggler recovery methods are commonly used as separate approaches for data recovery. However, parallel recovery methods encounter bottlenecks on the recovery path due to the presence of straggler nodes. Meanwhile, straggler recovery methods face the challenge of lacking available recovery paths in cases of multiple node failures. Scenarios involving both multiple failures and stragglers are common, yet there is a lack of efficient recovery methods for these situations. In this paper, we focus on scenarios involving video data, which occupies a significant portion of cloud storage systems, to address the above issues. We propose a Hybrid Global Graph-based Recovery (HGR) method that integrates parallel and straggler recovery approaches into a single global graph. The key idea of HGR is to construct a global graph that includes global node parameter information, enabling comprehensive coordination. We partition the global graph into two subgraphs: one containing straggler nodes and the other containing failure nodes. Resources are efficiently allocated to each subgraph to schedule recovery tasks in parallel. For data that presents significant recovery challenges, exhibits poor parallelism, has substantial tail latency, or exceeds fault tolerance limits, we employ approximate recovery methods. To demonstrate HGR's effectiveness, we conducted several experiments. The results indicate that HGR can reduce recovery time by up to 45.06% and improve I/O throughput by as much as 1.79x compared to state-of-the-art recovery methods.
What problem does this paper attempt to address?