Enabling Low-Redundancy Proactive Fault Tolerance for Stream Machine Learning Via Erasure Coding

Zhinan Cheng,Lu Tang,Qun Huang,Patrick P. C. Lee
DOI: https://doi.org/10.1109/srds53918.2021.00019
2022-01-01
SSRN Electronic Journal
Abstract:Machine learning for continuous data streams, or stream machine learning in short, is increasingly adopted in real-time big data applications. Fault tolerance is a critical requirement for stream machine learning applications in large-scale distributed deployment. However, existing reactive fault tolerance mechanisms, which trigger failure recovery upon the detection of failures, inevitably incur high recovery overhead and compromise the low-latency requirement of stream machine learning. We design StreamLEC, a stream machine learning system that leverages erasure coding to provide low-redundancy proactive fault tolerance for immediate failure recovery. StreamLEC supports general stream machine learning applications, and incorporates different techniques to mitigate erasure coding overhead. Evaluation on a local cluster and Amazon EC2 shows that StreamLEC achieves much higher throughput than both reactive fault tolerance and replication-based proactive fault tolerance, with negligible failure recovery overhead.
What problem does this paper attempt to address?