Abstract:Cloud storage systems often face the issues of failure and straggler nodes. Failure is characterized as a fail-stop scenario, which refers to disk failures that can result in significant data unavailability. Straggler nodes are typically those with heavy workloads or poor performance. Usually, both failure and straggler nodes coexist, posing a significant challenge to data availability in storage systems. In such failure scenarios, parallel recovery and straggler recovery methods are commonly used as separate approaches for data recovery. However, parallel recovery methods encounter bottlenecks on the recovery path due to the presence of straggler nodes. Meanwhile, straggler recovery methods face the challenge of lacking available recovery paths in cases of multiple node failures. Scenarios involving both multiple failures and stragglers are common, yet there is a lack of efficient recovery methods for these situations. In this paper, we focus on scenarios involving video data, which occupies a significant portion of cloud storage systems, to address the above issues. We propose a Hybrid Global Graph-based Recovery (HGR) method that integrates parallel and straggler recovery approaches into a single global graph. The key idea of HGR is to construct a global graph that includes global node parameter information, enabling comprehensive coordination. We partition the global graph into two subgraphs: one containing straggler nodes and the other containing failure nodes. Resources are efficiently allocated to each subgraph to schedule recovery tasks in parallel. For data that presents significant recovery challenges, exhibits poor parallelism, has substantial tail latency, or exceeds fault tolerance limits, we employ approximate recovery methods. To demonstrate HGR's effectiveness, we conducted several experiments. The results indicate that HGR can reduce recovery time by up to 45.06% and improve I/O throughput by as much as 1.79x compared to state-of-the-art recovery methods.

Fast Failure Recovery in Vertex-Centric Distributed Graph Processing Systems

Fast Failure Recovery in Distributed Graph Processing Systems.

Replication-Based Fault-Tolerance for Large-Scale Graph Processing

CoRAL: Confined Recovery in Distributed Asynchronous Graph Processing.

A Distributed Graph-Parallel Computing System with Lightweight Communication Overhead

Towards Efficient Graph Processing in Geo-Distributed Data Centers

Fargraph+: Excavating the Parallelism of Graph Processing Workload on RDMA-based Far Memory System

Fast Connection Recovery Against Region Failures with Landmark-Based Source Routing

SparkRDF: Elastic Discreted RDF Graph Processing Engine with Distributed Memory

Efficient Processing of Very Large Graphs in a Small Cluster

G-Tran: A High Performance Distributed Graph Database with a Decentralized Architecture

A fault-tolerant optimization mechanism for spatiotemporal data analysis in flink

Excavating the Potential of Graph Workload on RDMA-based Far Memory Architecture

ScaleG: A Distributed Disk-based System for Vertex-centric Graph Processing

HGR: A Hybrid Global Graph-Based Recovery Approach for Cloud Storage Systems with Failure and Straggler Nodes

Garaph: Efficient GPU-accelerated Graph Processing on a Single Machine with Balanced Replication.

ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms

CORE: Augmenting regenerating-coding-based recovery for single and concurrent failures in distributed storage systems

A Two-Phase Method to Balance the Result of Distributed Graph Repartitioning

Dayu: Fast and Low-interference Data Recovery in Very-large Storage Systems

A Fast Repair Code Based on Regular Graphs for Distributed Storage Systems