Bandwidth-aware Delayed Repair in Distributed Storage Systems

Jiajie Shen,Jiazhen Gu,Yangfan Zhou,Xin Wang
DOI: https://doi.org/10.1109/iwqos.2016.7590386
2016-01-01
Abstract:In data storage systems, data are typically stored in redundant storage nodes to ensure storage reliability. When storage nodes fail, with the help of the redundant nodes, the lost data can be restored in new storage nodes. Such a regeneration process may be aborted, since storage nodes may fail during the process. Therefore, reducing the time of regeneration process is a well-known challenge to improve the reliability of storage systems. Delayed repair is a typical repair scheme in real-world storage systems. It reduces the overhead of the regeneration process by recovering multiple node failures simultaneously. How to reduce the regeneration time of delayed repair is yet to be well addressed. Since available bandwidth is flowing in storage systems and the regeneration time is seriously affected by the available bandwidth, we find the key to solve this problem is determining the start time of the regeneration process. Via modeling this problem with Lyaponuv optimization framework, we propose an OMFR scheme to reduce the regeneration time. The experimental results show that OMFR scheme can reduce cumulative regeneration time by up to 78% compared with traditional delayed repair schemes.
What problem does this paper attempt to address?