Abstract:Spatiotemporal data analysis plays a vital role in big data processing, and it is also a research hotspot in location-aware and recommender systems. In these applications, graph modeling and distributed iterative computing are the basis and guarantee for data query and mining. Because of the constant repeated execution of specific calculation logic, iterative jobs have the characteristics of being time-consuming and exerting high pressure on system resources. However, iterative jobs always face the risk of stopping due to computing node fault, which in turn causes serious economic losses. At present, the latest generation of distributed computing system Flink's recovery strategy for node faults in batch processing mode is to restart the job from the beginning, which is extremely time-consuming. If the checkpoint mechanism in Flink's stream-processing mode is used to recover from batch jobs failures, it will greatly increase the running time and storage overhead in trouble-free state. Therefore, a lightweight fault-tolerant mechanism is needed to reduce failure recovery time while ensuring the job efficiency of spatiotemporal data analysis. In view of the above situation and the characteristics of the iterative computing model for graph computing, a single-node failure recovery mechanism only for the failed node is proposed, which reduces the failure recovery time by introducing lightweight checkpoints and local logs. Based on the proposed single-node failure recovery mechanism, a failure recovery mechanism under multi-node fault and associated fault is proposed, which can cope with more complex failure situations occurs. Experimental results show that the proposed method can quickly and effectively recover jobs after failure, reducing the average recovery time by 37% in the case of single node fault, and reducing the average recovery time by 24% in the case of multi-node fault.

Dynamic Adaptive Checkpoint Mechanism for Streaming Applications Based on Reinforcement Learning

Research on Optimal Checkpointing-Interval for Flink Stream Processing Applications

TATA: Throughput-Aware TAsk Placement in Heterogeneous Stream Processing with Deep Reinforcement Learning

CheckMate: Evaluating Checkpointing Protocols for Streaming Dataflows

Optimal Multi-Level Interval-based Checkpointing for Exascale Stream Processing Systems

A Comprehensive Benchmarking Analysis of Fault Recovery in Stream Processing Frameworks

Mdash: A Markov Decision-Based Rate Adaptation Approach for Dynamic HTTP Streaming

A Utilization Model for Optimization of Checkpoint Intervals in Distributed Stream Processing Systems

StreamDFP: A General Stream Mining Framework for Adaptive Disk Failure Prediction

Deep-Reinforcement-Learning-based User-Preference-Aware Rate Adaptation for Video Streaming

Dynamic Resource Management In a Massively Parallel Stream Processing Engine

An Integrated DBP for Streams with ( M, K )-Firm Real-Time Guarantee

Dynamic Threshold Based Rate Adaptation For Http Live Streaming

A fault-tolerant optimization mechanism for spatiotemporal data analysis in flink

Dancing with Shackles, Meet the Challenge of Industrial Adaptive Streaming Via Offline Reinforcement Learning

Non-Authentication Based Checkpoint Fault-tolerant Vulnerability in Spark Streaming

Programming Support and Adaptive Checkpointing for High-Throughput Data Services with Log-Based Recovery

Convergence-aware optimal checkpointing for exploratory deep learning training jobs

Learning Accurate Network Dynamics for Enhanced Adaptive Video Streaming

Toward Adaptive Disk Failure Prediction Via Stream Mining

A Study on the Method of Adaptive Time Intervals Checkpointing