Flo: a Semantic Foundation for Progressive Stream Processing

Shadaj Laddad,Alvin Cheung,Joseph M. Hellerstein,Mae Milano
2024-11-13
Abstract:Streaming systems are present throughout modern applications, processing continuous data in real-time. Existing streaming languages have a variety of semantic models and guarantees that are often incompatible. Yet all these languages are considered "streaming" -- what do they have in common? In this paper, we identify two general yet precise semantic properties: streaming progress and eager execution. Together, they ensure that streaming outputs are deterministic and kept fresh with respect to streaming inputs. We formally define these properties in the context of Flo, a parameterized streaming language that abstracts over dataflow operators and the underlying structure of streams. It leverages a lightweight type system to distinguish bounded streams, which allow operators to block on termination, from unbounded ones. Furthermore, Flo provides constructs for dataflow composition and nested graphs with cycles. To demonstrate the generality of our properties, we show how key ideas from representative streaming and incremental computation systems -- Flink, LVars, and DBSP -- have semantics that can be modeled in Flo and guarantees that map to our properties.
Programming Languages,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problems of diversity and incompatibility of semantic models and guarantees in streaming processing systems. Specifically, although existing streaming processing languages are all considered "streaming processing", their definitions of streams, state persistence semantics, and their understandings of concepts such as window aggregation and batch processing are all different. This has led to the semantic ambiguity of streaming processing systems and it is difficult to unify them. To solve these problems, the authors propose two key semantic properties: **streaming progress** and **eager execution**. These two properties ensure that the output of streaming processing is deterministic and up - to - date with respect to the input. By introducing these properties, the paper provides a formal framework for understanding and comparing different streaming processing languages. ### Main contributions 1. **Formally define streaming progress and eager execution**: In the context of Flo, a parameterized streaming processing language, the authors formally define these two properties and point out a type system for reasoning about the termination of streams. 2. **Introduce the construction of combinatorial operators**: The authors introduce the construction of combinatorial operators into data flow graphs in Flo and prove that these constructions preserve key properties. 3. **Describe the semantics of nested streams and graphs**: The authors describe the semantics of nested streams and graphs in Flo and show how they integrate with streaming progress and eager execution. 4. **Show the core capabilities and essences of existing streaming processing languages**: The authors show how to use Flo to model key ideas from representative streaming processing languages and incremental computing systems, and show how existing semantic goals map to streaming progress and eager execution. ### Motivating examples To understand why a streaming processing model with strong semantic guarantees is needed, the authors illustrate the problems that developers may encounter when writing programs to process number streams and sum them through a simple program example. For example, using the `fold` operator can cause the program to hang on an unterminated stream, thus consuming resources without producing output. To address this problem, the authors explore several solutions, including: - Checking bounded constraints - Forcing conversion to a bounded stream - Using streaming operators (such as `scan`) These solutions all serve two core properties: streaming progress and eager execution. In this way, the authors show how to satisfy these two properties in different scenarios to ensure the safe and efficient execution of streaming processing programs. ### Summary The main goal of this paper is to provide a unified semantic basis for streaming processing systems by introducing the two properties of streaming progress and eager execution. This not only helps to understand the differences between existing streaming processing languages, but also provides theoretical support for designing more reliable and efficient streaming processing systems.