Fast Failure Recovery in Vertex-Centric Distributed Graph Processing Systems

Wei Lu,Yanyan Shen,Tongtong Wang,Meihui Zhang,H. V. Jagadish,Xiaoyong Du
DOI: https://doi.org/10.1109/TKDE.2018.2843361
IF: 9.235
2019-01-01
IEEE Transactions on Knowledge and Data Engineering
Abstract:There is a growing need for distributed graph processing systems to have many more compute nodes processing graph-based Big Data applications, which, however, increases the chance of node failures. To address the issue, we propose a novel recovery scheme to accelerate the recovery process by parallelizing the recomputation. Once a failure occurs, all recomputations are confined to subgraphs that originally reside in the failed compute nodes. When the recovery starts, these subgraphs are reassigned to another set of compute nodes, where the recomputation over these subgraphs are conducted in parallel. To minimize the recovery latency, we also develop a reassignment strategy, from these subgraphs to the replaced compute nodes, by properly leveraging the computation and communication cost. We integrate the proposed recovery scheme into Giraph system, a widely used graph processing system. The experimental results over a variety of real graph datasets demonstrate that our proposed recovery scheme outperforms existing recovery methods by up to 30x on a cluster of 40 compute nodes.
What problem does this paper attempt to address?