RCS: A Redirection Computational Scheduler to Accelerate Straggler Recovery for Erasure Coded Cloud Storage System
Xinzhe Cao,Yunfei Gu,Chentao Wu,Jie Li,Minyi Guo,Yuanyuan Dong,Yafei Zhao
DOI: https://doi.org/10.1109/iccd56317.2022.00104
2022-01-01
Abstract:The straggler problem is one of the most significant problems in cloud computing systems, in which a large number of parallel processes are blocked by a small set of straggler tasks with a long waiting time. This problem is crucial in erasure coded storage systems, where the recovery processes require to retrieve a set of multiple chunks among different nodes. With skewed data accesses from various applications, several nodes with a high workload could easily become stragglers during the recovery process, leading to unacceptable long tail latency. To address the above problems, we propose a Redirection Computational Scheduling method called RCS, to accelerate the data recovery under straggler scenarios. The key idea of RCS is transferring the computational and network workload from one node to another, which can avoid the adverse effects caused by the stragglers. To demonstrate the effectiveness of RCS, we conduct several experiments in a cluster. The results show that, compared to the state-of-the-art recovery methods, RCS saves the recovery time by up to 72.1%, and speeds up the recovery throughput by up to a factor of 1.4X, respectively.