Abstract:Stream processing has become a critical component in the architecture of modern applications. With the exponential growth of data generation from sources such as the Internet of Things, business intelligence, and telecommunications, real-time processing of unbounded data streams has become a necessity. DSP systems provide a solution to this challenge, offering high horizontal scalability, fault-tolerant execution, and the ability to process data streams from multiple sources in a single DSP job. Often enough though, data streams need to be enriched with extra information for correct processing, which introduces additional dependencies and potential bottlenecks. In this paper, we present an in-depth evaluation of data enrichment methods for DSP systems and identify the different use cases for stream processing in modern systems. Using a representative DSP system and conducting the evaluation in a realistic cloud environment, we found that outsourcing enrichment data to the DSP system can improve performance for specific use cases. However, this increased resource consumption highlights the need for stream processing solutions specifically designed for the performance-intensive workloads of cloud-based applications.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper primarily explores data augmentation methods in Distributed Stream Processing (DSP) systems and evaluates the effectiveness of these methods for different application scenarios. Specifically: 1. **Research Background and Motivation**: - With the rapid increase in data generation in fields such as the Internet of Things, business intelligence, and telecommunications, real-time processing of unbounded data streams has become a key component of modern application architectures. - DSP systems offer high horizontal scalability, fault-tolerant execution, and the ability to process data from multiple sources. However, data streams often require additional information for augmentation during processing, which can introduce extra dependencies and potential bottlenecks. 2. **Research Objectives**: - To determine suitable data augmentation methods for specific application scenarios through an in-depth evaluation of different data augmentation methods. - To focus particularly on low-latency applications, selecting Apache Flink as the representative DSP system for experiments. 3. **Contributions**: - Defined general categories of application scenarios and analyzed the assumptions and common application scenarios of data augmentation methods. - Conducted a detailed empirical evaluation of various data augmentation methods, providing an understanding based on different application scenarios. - Provided a publicly available repository containing all relevant experimental artifacts and documentation. 4. **Specific Problem Analysis**: - The need for data augmentation may arise from the lack of necessary contextual information in the data, the need to detect hidden patterns, or the integration of data from different sources. - Defined three categories of application scenarios based on data availability, data volume, and time sensitivity: simple queries, complex queries, and limited data sources. 5. **Experimental Results**: - Compared the performance of synchronous and asynchronous data augmentation methods, with results showing that asynchronous methods have lower and more stable latency. - Although asynchronous methods outperform synchronous methods, they still face bottlenecks under high throughput, necessitating further research into caching strategies. Through the above analysis, this paper aims to provide guidance for practitioners to select appropriate data augmentation methods based on specific application scenarios.

Evaluation of Data Enrichment Methods for Distributed Stream Processing Systems

TATA: Throughput-Aware TAsk Placement in Heterogeneous Stream Processing with Deep Reinforcement Learning

Benchmarking Distributed Stream Data Processing Systems

Demeter: Resource-Efficient Distributed Stream Processing under Dynamic Loads with Multi-Configuration Optimization

Using Paralleled-PEs Method to Resolve the Bursting Data in Distributed Stream Processing System

Progressive online aggregation in a distributed stream system

Data Stream Processing for Packet-Level Analytics

A Scalable and Robust Framework for Data Stream Ingestion

Model-driven development of data intensive applications over cloud resources

A demonstration of the MaxStream federated stream processing system

Hardware-Conscious Stream Processing

Design and Implementation of the MaxStream Federated Stream Processing Architecture

Study and Implementation of Elastic Stream Computing in the Cloud

Daedalus: Self-Adaptive Horizontal Autoscaling for Resource Efficiency of Distributed Stream Processing Systems

Analyzing efficient stream processing on modern hardware

Streaming vs. Functions: A Cost Perspective on Cloud Event Processing

Scalable and Reliable Multi-Dimensional Aggregation of Sensor Data Streams

Service Intelligence Oriented Distributed Data Stream Integration

SpeedStream: A real-time stream data processing platform in the cloud.

An Exploratory Study of How Specialists Deal with Testing in Data Stream Processing Applications

Using Dedicated and Opportunistic Networks in Synergy for a Cost-effective Distributed Stream Processing Platform