Abstract:Stream processing systems are widely used to process large amounts of data generated by applications in real time due to their advantages in latency and throughput. In most streaming applications, the system requires a comprehensive analysis of data from multiple data sources, so stream joins are the basis of stream processing systems. Similar to other big data problems, stream joins suffer from load imbalance, where a few nodes responsible for handling most of the load can become bottlenecks, thereby increasing latency and reducing throughput. Therefore, how to obtain a good load-balancing effect with low overhead is a critical issue in designing stream join systems. To solve this problem, we propose an adaptive non-migrating load-balancing method, which is mainly oriented to the stream window join problem. Considering that the completeness of the stream join results during the splitting of state to multiple downstream instances can be guaranteed by replicating the input tuples into multiple replicas and sending them to those downstream instances, our method can control the replication and forwarding of input tuples by setting up routing tables, and then when the system becomes unbalanced, our method can change the load distribution of the system by directly changing the partitioning of the tuples arriving later instead of state migration, and thus achieving load balancing with very low overhead. Based on our method, we develop a distributed stream window join system, NM-Join, which is built on Flink. We theoretically analyze the completeness and effectiveness of our method and provide extensive experimental evaluations of NM-Join in terms of load-balancing effect, latency, and throughput. Experimental results show that our method is able to perform load balancing with very low additional overhead, and thus outperforms existing load-balancing methods in terms of latency and throughput.

FastJoin: A Skewness-Aware Distributed Stream Join System

Join Query Optimization Based on MapReduce under Skewed Data

High-Performance Data Distribution Algorithm on Distributed Stream Systems

Online Join Method for Skewed Data Streams

Flexible and Adaptive Stream Join Algorithm.

Simois: A Scalable Distributed Stream Join System with Skewed Workloads

Distributed Stream Join under Workload Variance.

SepJoin: A Distributed Stream Join System with Low Latency and High Throughput

Cost-Effective Data Partition For Distributed Stream Processing System

An Adaptive Non-Migrating Load-Balanced Distributed Stream Window Join System

Cost-Effective Stream Join Algorithm on Cloud System

An Adaptive Skew Handling Join Algorithm for Large-scale Data Analysis

Research on Data Skew Join Algorithm Based on MapReduce Model

Distributed Streaming Set Similarity Join

AdaptMX: Flexible Join-Matrix Streaming System for Distributed Theta-Joins.

TriJoin: A Time-Efficient and Scalable Three-Way Distributed Stream Join System

Fast and Scalable Distributed Set Similarity Joins for Big Data Analytics.

Low-Latency Adaptive Distributed Stream Join System Based on a Flexible Join Model

PStream: A Popularity-Aware Differentiated Distributed Stream Processing System

BS-Join: A Novel and Efficient Mixed Batch-Stream Join Method for Spatiotemporal Data Management in Flink

Parallel Stream Processing Against Workload Skewness and Variance