Reducing Fault-tolerant Overhead for Distributed Stream Processing with Approximate Backup

Yuan Zhuang,Xiaohui Wei,Hongliang Li,Mingkai Hou,Yundi Wang
DOI: https://doi.org/10.1109/ICCCN49398.2020.9209717
2020-01-01
Abstract:The stream processing model continuously processes online data in an on-pass fashion that can be more vulnerable to failures than other offline-data processing schemes. Checkpoint-based fault-tolerant methods have been widely used to enhance the reliability of stream processing systems. To ensure exact data recoveries upon failures, full-backup mechanisms are used to store a complete copy of data, which introduces substantial runtime overhead and increases output latency. In the meantime, a wide range of online processing applications prefer quick-and-dirty results with a slight degradation inaccuracy to delayed exact results. This paper introduces a novel approximate fault-tolerant problem (OAFP) with the objective of reducing the failure-free fault-tolerant overhead and ensuring user-defiled output accuracy requirement upon failure at the same time. We present an approximate fault-tolerant scheme based on sampling backup mechanism and study the trade-off between fault-tolerant overhead and output accuracy in stream processing systems. We proposed two algorithms to compute backup plans for both single-node failure and correlated failure scenarios. Extensive experiments with different types of stream topologies are conducted on our simulator to verify the correctness and effectiveness of our approach. We prove our solution guarantees the output accuracy requirement with minimum FT latency for directed acyclic graph (DAG) stream topologies with single-node failures.
What problem does this paper attempt to address?