Simois: A Scalable Distributed Stream Join System with Skewed Workloads

Fan Zhang,Hanhua Chen,Hai Jin
DOI: https://doi.org/10.1109/ICDCS.2019.00026
2019-01-01
Abstract:Many BigData applications require to perform quick join operations on different large-scale real-time data streams. The key challenge to design an efficient stream join system is how to reasonably partition the streaming data among distributed processing nodes to avoid high density of join computation. However, the skewed distribution of real world streams raises great challenges for streaming data partitioning in distributed stream join systems. Existing hash based partitioning schemes incur significant load imbalance which leads to low system throughput and long processing latency, while shuffling based strategies incur redundant join computation and much more communication. To address this issue, in this paper, we propose and implement a scalable distributed stream join system, Simois, which shuffles the potential top heavy-load keys while hashing the others. However, how to identify the keys which lead to the heavy workload imbalance is challenging, because the heavy workload is determined by the current joint status of two streams, and the distribution of the two streams may change with time. To solve this problem, we design a novel efficient exponential counting scheme for identifying the keys with the heaviest workload in the two dynamic streams. The proposed exponential counting scheme needs extremely low computation and space cost, so that it can be well implemented in a stream processing system. Moreover, we design a popularity decline algorithm to make our design adaptive to the highly dynamic changes of streams. We implement Simois on top of Apache Storm and conduct comprehensive experiments using large-scale real world traces. Experiment results show that Simois improves the system throughput significantly by 52% and reduces the average latency by 37%, compared to existing state-of-the-art designs.
What problem does this paper attempt to address?