Abstract:Repairs of multiple failures in distributed storage systems have posed the challenges for erasure coding: how to minimize the repair time with the least extra repair network traffic cost. However, existing repair schemes designed for single failure suffer from the high network traffic cost due to the serial repairs for multiple failures. Repair schemes designed for multiple failures suffer from long repair time due to the centralized repair structure. In this paper, we propose a decentralized adaptive repair scheme, called DARS, to minimize the repair time with the least extra network traffic cost. Specially, we propose a three‐layer repair model to support the repairs for both the single and multiple failures. For low repair time, a bandwidth‐aware node selection technique is proposed to guide the selection of nodes, and a line‐structured data transmission technique is proposed to organize the data transmission between the providers and the newcomer. For the least extra network traffic cost, a core‐based data distribution technique is proposed to organize the data transmission between the coordinator and other newcomers, and an intersection provider adjustment technique is proposed to adaptively adjust the number of intersection providers. Moreover, we adopt the ‘lazy repair’ within a stripe to further reduce the repair network traffic cost. We implement and evaluate DARS on our raid distributed storage system under various parameter settings with 30 physical machines and 200 virtual machines. Experimental results confirm that DARS reduces the repair time by 29% and 55% on average compared with tree‐structured repair and CORE, respectively. Copyright © 2015 John Wiley & Sons, Ltd.

Greedy Transfer Planning Search for Improving Repair Throughput of RDP-like Coded Storage Clusters

Repairing Multiple Failures Adaptively with Erasure Codes in Distributed Storage Systems

Concurrent Node Reconstruction for Erasure-Coded Storage Clusters

Multi-level Forwarding and Scheduling Recovery Algorithm in Rapidly-changing Network for Erasure-coded Clusters

A Comprehensive Repair Scheme for Distributed Storage Systems

RCS: A Redirection Computational Scheduler to Accelerate Straggler Recovery for Erasure Coded Cloud Storage System

Toward Optimal Repair and Load Balance in Locally Repairable Codes.

Cooperative Repair Based on Tree Structure for Multiple Failures in Distributed Storage Systems with Regenerating Codes

Optimal Data Placement for Stripe Merging in Locally Repairable Codes

Optimal Repair Layering for Erasure-Coded Data Centers: from Theory to Practice

An Efficient I/O-Redirection-Based Reconstruction Scheme for Erasure-Coded Storage Clusters

CoRec: A Cooperative Reconstruction Pattern for Multiple Failures in Erasure-Coded Storage Clusters.

Accelerating erasure coding by exploiting multiple repair paths in distributed storage systems

Cost Optimal Regenerating Codes Design for Satellite Clustered Distributed Storage System.

Exploiting Decoding Computational Locality to Improve the I/O Performance of an XOR-Coded Storage Cluster under Concurrent Failures

A Data Layout and Fast Failure Recovery Scheme for Distributed Storage Systems with Mixed Erasure Codes

Global Repair Bandwidth Cost Optimization of Generalized Regenerating Codes in Clustered Distributed Storage Systems.

On the Speedup of Recovery in Large-Scale Erasure-Coded Storage Systems

Multi-node Repair Based on GA_PSO with Fractional Regenerating Code Combined with Prior Replication.

LaRS: A Load-Aware Recovery Scheme for Heterogeneous Erasure-Coded Storage Clusters

On the Optimal Provider Selection for Repair in Distributed Storage System with Network Coding