Abstract:This paper presents LMStream, which ensures bounded latency while maximizing the throughput on the GPU-enabled micro-batch streaming systems. The main ideas behind LMStream's design can be summarized as two novel mechanisms: (1) dynamic batching and (2) dynamic operation-level query planning. By controlling the micro-batch size, LMStream significantly reduces the latency of individual dataset because it does not perform unconditional buffering only for improving GPU utilization. LMStream bounds the latency to an optimal value according to the characteristics of the window operation used in the streaming application. Dynamic mapping between a query to an execution device based on the data size and dynamic device preference improves both the throughput and latency as much as possible. In addition, LMStream proposes a low-overhead online cost model parameter optimization method without interrupting the real-time stream processing. We implemented LMStream on Apache Spark, which supports micro-batch stream processing. Compared to the previous throughput-oriented method, LMStream showed an average latency improvement up to a maximum of 70.7%, while improving average throughput up to 1.74x.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is how to maximize throughput while ensuring low latency in a GPU - accelerated distributed micro - batch streaming processing system. Specifically, the existing micro - batch streaming processing models improve GPU utilization by unconditionally buffering data, which leads to an unlimited increase in the latency of a single data set, thus affecting the performance of real - time streaming processing applications. The paper proposes a new method named LMStream, which ensures the boundedness of latency and improves the overall throughput of the system by dynamically adjusting the micro - batch size and the execution device (CPU or GPU). ### Core Problems of the Paper 1. **Trade - off between Latency and Throughput**: In a streaming processing system, latency and throughput are usually contradictory. To increase throughput, existing methods usually buffer data unconditionally, but this will lead to a significant increase in latency. The goal of the paper is to increase the throughput of the system as much as possible without sacrificing latency. 2. **Dynamic Micro - batch Control**: The traditional micro - batch model uses a fixed trigger time to decide when to process data, which leads to an uncontrollable growth in latency. The paper proposes a dynamic micro - batch control mechanism, which dynamically adjusts the size of the micro - batch according to the characteristics of window operations to ensure that the latency remains within a reasonable range. 3. **Effective Query Plan**: To further optimize latency and throughput, the paper proposes an operation - level query plan mechanism, which dynamically selects an appropriate execution device (CPU or GPU) according to the size of the data. This can reduce the total processing time and improve the performance of the system at the same time. 4. **Online Parameter Optimization**: When a streaming processing application starts running, the system has no prior information about the characteristics of the workload. The paper proposes a low - overhead online cost model parameter optimization method, which can dynamically adjust system parameters to adapt to different workload types without affecting real - time streaming processing. ### Main Contributions - **Dynamic Micro - batch Mechanism**: LMStream proposes a dynamic micro - batch mechanism, which ensures the boundedness of latency by dynamically adjusting the size of the micro - batch. - **Effective Operation - level Query Plan**: LMStream reduces the processing time and improves throughput and latency at the same time by dynamically selecting an appropriate execution device (CPU or GPU). - **Online Parameter Optimization**: LMStream implements a low - overhead online parameter optimization method, which can dynamically adjust system parameters without interrupting real - time streaming processing. - **Implementation in a Real System**: The paper implements LMStream on Apache Spark and verifies its effectiveness and performance improvement through a variety of real - world streaming processing benchmark tests. ### Specific Formulas 1. **Objective Function**: - Maximize the average throughput: \[ \max_{i} \text{AvgThPut}_i \] - Constraints: \[ \text{MaxLat}_i < \text{SlideTime} \quad (\text{when} \text{SlideTime} > 0) \] \[ \text{MaxLat}_i \leq \sum_{k = 0}^{i - 1} \text{MaxLat}_k \quad (\text{when} \text{SlideTime} = 0) \] 2. **Definitions of Throughput and Latency**: - Average throughput: \[ \text{AvgThPut}_i=\frac{\sum_{k = 0}^{i} \sum_{j = 0}^{\text{NumCores}} \text{Part}(k, j)}{\sum_{k = 0}^{i} \text{Proc}_k} \] - Maximum latency: \[ \text{MaxLat}_i=\max_{j \in \text{NumDS}_i} (\text{Buff}(i, j))

LMStream: When Distributed Micro-Batch Stream Processing Systems Meet GPU

Scalable-Grain Pipeline Parallelization Method For Multi-Core Systems

A holistic approach to build real-time stream processing system with GPU

Stream-K++: Adaptive GPU GEMM Kernel Scheduling and Selection using Bloom Filters

Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs.

Progressive online aggregation in a distributed stream system

Lc‐Stream: An elastic scheduling strategy with latency constraints in geo‐distributed stream computing environments

Fine-Grained Multi-Query Stream Processing on Integrated Architectures

Marabunta: Continuous Distributed Processing of Skewed Streams

Efficient Streaming Language Models with Attention Sinks

Benchmarking Distributed Stream Data Processing Systems

Throughput Optimization For Streaming Applications On Cpu-Fpga Heterogeneous Systems

MicroStream: A Distributed In-memory Caching Service for Data Production

A Parallel GPU-Based Approach to Clustering Very Fast Data Streams

Memory Efficient On-Line Streaming for Multichannel Spike Train Analysis

Railgun: managing large streaming windows under MAD requirements

A Scalable Software Framework for Stateful Stream Data Processing on Multiple GPUs and Applications

SMDP-Based Dynamic Batching for Efficient Inference on GPU-Based Platforms

Resource- and Message Size-Aware Scheduling of Stream Processing at the Edge with application to Realtime Microscopy

An Efficient Scheduling Algorithm for Stream Computing.

FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines