Abstract:Stream processing systems are widely used to process large amounts of data generated by applications in real time due to their advantages in latency and throughput. In most streaming applications, the system requires a comprehensive analysis of data from multiple data sources, so stream joins are the basis of stream processing systems. Similar to other big data problems, stream joins suffer from load imbalance, where a few nodes responsible for handling most of the load can become bottlenecks, thereby increasing latency and reducing throughput. Therefore, how to obtain a good load-balancing effect with low overhead is a critical issue in designing stream join systems. To solve this problem, we propose an adaptive non-migrating load-balancing method, which is mainly oriented to the stream window join problem. Considering that the completeness of the stream join results during the splitting of state to multiple downstream instances can be guaranteed by replicating the input tuples into multiple replicas and sending them to those downstream instances, our method can control the replication and forwarding of input tuples by setting up routing tables, and then when the system becomes unbalanced, our method can change the load distribution of the system by directly changing the partitioning of the tuples arriving later instead of state migration, and thus achieving load balancing with very low overhead. Based on our method, we develop a distributed stream window join system, NM-Join, which is built on Flink. We theoretically analyze the completeness and effectiveness of our method and provide extensive experimental evaluations of NM-Join in terms of load-balancing effect, latency, and throughput. Experimental results show that our method is able to perform load balancing with very low additional overhead, and thus outperforms existing load-balancing methods in terms of latency and throughput.

Distributed Streaming Set Similarity Join

Continuous similarity join on data streams

Simois: A Scalable Distributed Stream Join System with Skewed Workloads

Fast and Scalable Distributed Set Similarity Joins for Big Data Analytics.

TriJoin: A Time-Efficient and Scalable Three-Way Distributed Stream Join System

FastJoin: A Skewness-Aware Distributed Stream Join System

Distributed Stream Join under Workload Variance.

SepJoin: A Distributed Stream Join System with Low Latency and High Throughput

Progressive online aggregation in a distributed stream system

Online Join Method for Skewed Data Streams

EMD-DSJoin: Efficient Similarity Join Over Probabilistic Data Streams Based on Earth Mover’s Distance

Multi-Way Windowed Streams Θ-Joins Using Cluster

Flexible and Adaptive Stream Join Algorithm.

Query Optimization over Distributed Data Stream

Efficient String Similarity Join in Multi-Core and Distributed Systems.

Efficient Join Processing Over Incomplete Data Streams (Technical Report)

High-Performance Data Distribution Algorithm on Distributed Stream Systems

PMJoin: Optimizing Distributed Multi-way Stream Joins by Stream Partitioning

Efficient and Scalable Processing of String Similarity Join

Sharing Aggregate Computation of Multiple Group by Queries over Distributed Data Stream

An Adaptive Non-Migrating Load-Balanced Distributed Stream Window Join System