FlexSP:(1 + Β)-Choice Based Flexible Stream Partitioning for Stateful Operators

Siyuan Chen,Decheng Zuo,Zhan Zhang
DOI: https://doi.org/10.1145/3673038.3673157
2024-01-01
Abstract:Stream partitioning has a fundamental effect on the efficiency of data parallelism in distributed stream processing systems. The skewed and time-varying nature of streaming data makes it challenging to achieve load balancing while minimizing the cost incurred. The requirement of adaptivity further complicates the problem, that the partitioning mechanism should not only be able to capture the changes in workload and adjust itself but also be quite tolerant of the changes because of the lag in statistics. Existing approaches use one-choice or multiple-choice schemes to make tradeoffs between these factors, but they tend to treat them as opposites, which either fails to achieve good load balancing or incurs excessive cost. There is a lack of deeper insight into how partitioning behavior affects load balancing, cost, and adaptivity when the keys have a different number of candidate choices. Also, it requires a flexible partitioning scheme to allow different trade-offs among the three factors for various scenarios. To address the issues mentioned above, we propose a novel (1 +beta)-choice based stream partitioning scheme, which splits beta is an element of(0, 1) part of keys selectively to have multiple candidate choices. We demonstrate that just splitting beta part of the keys is sufficient to achieve optimal load balancing while minimizing cost and providing the required adaptivity to workload variance. In a new perspective, we analyze the relationship among load balancing, cost, and adaptivity, as the theoretical foundation of getting proper beta and the corresponding number of choices. Experiments on Apache Flink demonstrate that our approach outperforms state-of-the-art solutions, improving throughput by 7.3x and reducing latency by 85%.
What problem does this paper attempt to address?