Dynamic Adaptive Checkpoint Mechanism for Streaming Applications Based on Reinforcement Learning

Zhan Zhang,Tianming Liu,Yanjun Shu,Siyuan Chen,Xian Liu
DOI: https://doi.org/10.1109/icpads56603.2022.00076
2023-01-01
Abstract:For a stream processing system that uses checkpoints as a fault-tolerant method, selecting the appropriate checkpoint period is the key to ensuring the efficient operation of streaming applications. State-of-art stream processing systems currently only support fixed-cycle checkpoints, which is difficult to make a good trade-off between fault-tolerant processing and the cost of failure recovery in dynamically changing streaming application scenarios. Moreover, in a complex distributed streaming application environment, the dynamic environmental indicators (e.g., the values of workloads and failure rates) are not in coincidence with the model assumptions, such as the dynamics of Twitter’s hot events data changing quickly. In this paper, we consider the dynamic changes of environmental indicators and adaptively optimize the processing delay and fault recovery time. Then, we propose a dynamic adjustment method for the checkpoint interval by reinforcement learning, which is named DACM. DACM adaptively optimizes the processing delay and fault recovery time, while avoiding the overall environment modeling of streaming applications. The experiments conducted on the Flink platform show that DACM reduces the processing delay by 10% and the failure recovery time by 37% compared with the existing checkpoint interval optimization models.
What problem does this paper attempt to address?