Abstract:In this paper, we present an advanced analysis of near optimal algorithms that use limited space to solve the frequency estimation, heavy hitters, frequent items, and top-k approximation in the bounded deletion model. We define the family of SpaceSaving$\pm$ algorithms and explain why the original SpaceSaving$\pm$ algorithm only works when insertions and deletions are not interleaved. Next, we propose the new Double SpaceSaving$\pm$, Unbiased Double SpaceSaving$\pm$, and Integrated SpaceSaving$\pm$ and prove their correctness. The three proposed algorithms represent different trade-offs, in which Double SpaceSaving$\pm$ can be extended to provide unbiased estimations while Integrated SpaceSaving$\pm$ uses less space. Since data streams are often skewed, we present an improved analysis of these algorithms and show that errors do not depend on the hot items. We also demonstrate how to achieve relative error guarantees under mild assumptions. Moreover, we establish that the important mergeability property is satisfied by all three algorithms, which is essential for running the algorithms in distributed settings.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to efficiently handle frequency estimation, heavy hitters, frequent items, and Top - k approximation problems in data streams under the bounded - deletion model. Specifically: 1. **Frequency Estimation Problem**: Given a data stream $\sigma$ from a universe $U$, the frequency estimation problem aims to construct a summary that provides an estimate $\hat{f}(x)$ of the true frequency $f(x)$ for each item $x\in U$. 2. **Heavy Hitters Problem**: Given a frequency threshold $T$, the goal is to identify all items with a frequency greater than $T$. This is very useful when data scientists and system operators need to know which items occur very frequently. ### Research Background and Challenges - **Insert - only Model**: All arriving data stream items are insert operations, and the frequency $f(x)$ represents the number of times $x$ has been inserted. - **Turnstile Model**: The arriving data stream items can be either insert or delete operations, and the net frequency $f(x)$ is the difference between the number of times $x$ has been inserted and deleted, and it is ensured that no item has a negative frequency. - **Bounded - deletion Model**: It is assumed that the number of deletions is limited to a fraction of the number of insertions, that is, $D\leq(1 - 1/\alpha)I$. The existing SpaceSaving ± algorithm is only applicable to the case where insertions and deletions are not interleaved when dealing with bounded deletions, while in practical applications, insertions and deletions are usually interleaved. Therefore, the paper proposes new algorithms to support the case where insertions and deletions are interleaved and provides a tighter error - bound analysis. ### Main Contributions 1. **Proposing New Algorithms**: The paper proposes three new algorithms - Double SpaceSaving ±, Unbiased Double SpaceSaving ±, and Integrated SpaceSaving ±. These algorithms can work in the case where insertions and deletions are interleaved, provide strong theoretical guarantees, and occupy less space at the same time. 2. **Introducing Residual Error Bound**: The residual error - bound guarantee is introduced in the bounded - deletion model, and it is proved that the proposed algorithms satisfy a stronger residual error - bound. 3. **Relative Error Guarantee**: Under mild assumptions, assuming that the data stream is skewed, it is proved that the proposed algorithms also achieve a relative error guarantee. 4. **Mergeability**: A merge algorithm is provided, and it is proved that all the proposed algorithms are mergeable in the bounded - deletion model, which is crucial for distributed applications. ### Formula Summary - **Frequency Estimation Problem**: \[ \forall x\in U, |f(x)-\hat{f}(x)|\leq\epsilon F_1 \] where $F_1 = \sum_{x\in U}f(x)$. - **Heavy Hitters Problem**: Given a parameter $\epsilon>0$, heavy hitters are items with a frequency of at least $\epsilon F_1$. - **Error Bound**: \[ |f(x)-\hat{f}(x)|\leq\frac{\epsilon}{m}F_1 \] - **Relative Error Bound**: \[ |f_i-\hat{f}_i|\leq\epsilon f_i, \forall i\leq k \] Through these improvements, the paper overcomes the limitations of existing algorithms in handling the interleaving of insertions and deletions and provides a more efficient solution for large - data - stream processing.

The SpaceSaving$\pm$ Family of Algorithms for Data Streams with Bounded Deletions

Streaming Algorithms with Few State Changes

Comments on “an Integrated Efficient Solution for Computing Frequent and Top-K Elements in Data Streams”

Learning-Based Heavy Hitters and Flow Frequency Estimation in Streams

Finding Frequent Items in Time Decayed Data Streams.

Space Complexity of Minimum Cut Problems in Single-Pass Streams

Improved Streaming Quotient Filter: A Duplicate Detection Approach for Data Streams

Novel structures for counting frequent items in time decayed streams

Algorithms for Efficient, Compact Online Data Stream Curation

QPOPSS: Query and Parallelism Optimized Space-Saving for Finding Frequent Stream Elements

Efficiently Filtering Duplicates over Distributed Data Streams

Continuous Angle-based Outlier Detection on High-dimensional Data Streams.

Frequent Items Mining Based on Weight in Data Stream

Web Technologies and Applications

SSS: an Accurate and Fast Algorithm for Finding Top-k Hot Items in Data Streams

Continuous Monitoring of Distributed Data Streams over a Time-Based Sliding Window

Sampling Space-Saving Set Sketches

An embarrassingly parallel optimal-space cardinality estimation algorithm

Structured Downsampling for Fast, Memory-efficient Curation of Online Data Streams

Engineering Semi-streaming DFS algorithms

Incremental Subspace Clustering over Multiple Data Streams