Abstract:Distributed Stream Processing Systems (DSPSs) are among the currently most emerging topics in data management, with applications ranging from real-time event monitoring to processing complex dataflow programs and big data analytics. The major market players in this domain are clearly represented by Apache Spark and Flink, which provide a variety of frontend APIs for SQL, statistical inference, machine learning, stream processing, and many others. Yet rather few details are reported on the integration of these engines into the underlying High-Performance Computing (HPC) infrastructure and the communication protocols they use. Spark and Flink, for example, are implemented in Java and still rely on a dedicated master node for managing their control flow among the worker nodes in a compute cluster. In this paper, we describe the architecture of our AIR engine, which is designed from scratch in C++ using the Message Passing Interface (MPI), pthreads for multithreading, and is directly deployed on top of a common HPC workload manager such as SLURM. AIR implements a light-weight, dynamic sharding protocol (referred to as "Asynchronous Iterative Routing"), which facilitates a direct and asynchronous communication among all client nodes and thereby completely avoids the overhead induced by the control flow with a master node that may otherwise form a performance bottleneck. Our experiments over a variety of benchmark settings confirm that AIR outperforms Spark and Flink in terms of latency and throughput by a factor of up to 15; moreover, we demonstrate that AIR scales out much better than existing DSPSs to clusters consisting of up to 8 nodes and 224 cores.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the performance bottlenecks and scalability issues existing in current Distributed Stream Processing Systems (DSPSs) when processing large - scale data streams. Specifically: 1. **Performance Bottlenecks**: Existing distributed stream processing systems such as Apache Spark and Flink rely on a centralized master node to manage and control the flow, which will lead to an increase in communication overhead and may form a performance bottleneck. Especially in large - scale clusters, this centralized architecture limits the overall performance of the system. 2. **Scalability Issues**: When existing distributed stream processing systems are scaled to larger - scale clusters, the performance improvement is not obvious. For example, when Spark and Flink are scaled to multiple nodes, due to the limitations of the centralized architecture, their performance improvement is not as expected. To overcome these problems, the paper proposes a new distributed stream processing engine - AIR (Asynchronous Iterative Routing), which is designed based on the following points: - **Master - less Architecture**: AIR completely avoids the overhead of the centralized master node through a dynamic sharding protocol (Asynchronous Iterative Routing) and realizes direct asynchronous communication between clients. - **Global Asynchronous Transformation Operators**: Stateless operators (such as Map, Split, Filter) process tasks completely asynchronously between communication channels and different nodes within the cluster. - **Local Asynchronous Sliding - Window Operators**: Stateful operators (such as Reduce, Join, Aggregate) use a combination of asynchronous local pre - processing and synchronous global processing, so as to remain asynchronous as a whole while ensuring correctness. - **Multi - threaded Channel Processing**: Communication channels are highly multi - threaded on the basis of the MPI API, which improves the core utilization rate of each data - stream operator. - **Pipelining**: For stateless operators, MPI communication can be reduced by directly pipelining message passing. Through these designs, AIR shows higher throughput and lower latency than Spark and Flink in a variety of benchmark tests, especially when scaled to a cluster of 8 nodes and 224 cores, the performance improvement is particularly significant.

AIR: A Light-Weight Yet High-Performance Dataflow Engine based on Asynchronous Iterative Routing

TATA: Throughput-Aware TAsk Placement in Heterogeneous Stream Processing with Deep Reinforcement Learning

Asynchronous Complex Analytics in a Distributed Dataflow Architecture

AutoFlow: Hotspot-Aware, Dynamic Load Balancing for Distributed Stream Processing

Streaming Data in HPC Workflows Using ADIOS

Data Stream Processing for Packet-Level Analytics

AgileDART: An Agile and Scalable Edge Stream Processing Engine

Design and implementation of reconfigurable acceleration for in-memory distributed big data computing.

Benchmarking Distributed Stream Data Processing Systems

An efficient architecture for processing real-time traffic data streams using apache flink

AirDnD -- Asynchronous In-Range Dynamic and Distributed Network Orchestration Framework

SunwayMR: A Distributed Parallel Computing Framework with Convenient Data-Intensive Applications Programming.

Canalis: A Throughput-Optimized Framework for Real-Time Stream Processing of Wireless Communication

SpeedStream: A real-time stream data processing platform in the cloud.

Rapier: Integrating routing and scheduling for coflow-aware data center networks

Pilot-Streaming: A Stream Processing Framework for High-Performance Computing

An elastic and traffic-aware scheduler for distributed data stream processing in heterogeneous clusters

Distributed File Streamer: A Framework for Distributed Application Data Coupling

Processing Particle Data Flows with SmartNICs

A Case Study of Accelerating Apache Spark with FPGA.

Using Paralleled-PEs Method to Resolve the Bursting Data in Distributed Stream Processing System