Fuheng Zhao,Divyakant Agrawal,Amr El Abbadi,Claire Mathieu,Ahmed Metwally,Michel de Rougemont
Abstract:In this paper, we present an advanced analysis of near optimal algorithms that use limited space to solve the frequency estimation, heavy hitters, frequent items, and top-k approximation in the bounded deletion model. We define the family of SpaceSaving$\pm$ algorithms and explain why the original SpaceSaving$\pm$ algorithm only works when insertions and deletions are not interleaved. Next, we propose the new Double SpaceSaving$\pm$, Unbiased Double SpaceSaving$\pm$, and Integrated SpaceSaving$\pm$ and prove their correctness. The three proposed algorithms represent different trade-offs, in which Double SpaceSaving$\pm$ can be extended to provide unbiased estimations while Integrated SpaceSaving$\pm$ uses less space. Since data streams are often skewed, we present an improved analysis of these algorithms and show that errors do not depend on the hot items. We also demonstrate how to achieve relative error guarantees under mild assumptions. Moreover, we establish that the important mergeability property is satisfied by all three algorithms, which is essential for running the algorithms in distributed settings.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to efficiently handle frequency estimation, heavy hitters, frequent items, and Top - k approximation problems in data streams under the bounded - deletion model. Specifically:
1. **Frequency Estimation Problem**: Given a data stream $\sigma$ from a universe $U$, the frequency estimation problem aims to construct a summary that provides an estimate $\hat{f}(x)$ of the true frequency $f(x)$ for each item $x\in U$.
2. **Heavy Hitters Problem**: Given a frequency threshold $T$, the goal is to identify all items with a frequency greater than $T$. This is very useful when data scientists and system operators need to know which items occur very frequently.
### Research Background and Challenges
- **Insert - only Model**: All arriving data stream items are insert operations, and the frequency $f(x)$ represents the number of times $x$ has been inserted.
- **Turnstile Model**: The arriving data stream items can be either insert or delete operations, and the net frequency $f(x)$ is the difference between the number of times $x$ has been inserted and deleted, and it is ensured that no item has a negative frequency.
- **Bounded - deletion Model**: It is assumed that the number of deletions is limited to a fraction of the number of insertions, that is, $D\leq(1 - 1/\alpha)I$.
The existing SpaceSaving ± algorithm is only applicable to the case where insertions and deletions are not interleaved when dealing with bounded deletions, while in practical applications, insertions and deletions are usually interleaved. Therefore, the paper proposes new algorithms to support the case where insertions and deletions are interleaved and provides a tighter error - bound analysis.
### Main Contributions
1. **Proposing New Algorithms**: The paper proposes three new algorithms - Double SpaceSaving ±, Unbiased Double SpaceSaving ±, and Integrated SpaceSaving ±. These algorithms can work in the case where insertions and deletions are interleaved, provide strong theoretical guarantees, and occupy less space at the same time.
2. **Introducing Residual Error Bound**: The residual error - bound guarantee is introduced in the bounded - deletion model, and it is proved that the proposed algorithms satisfy a stronger residual error - bound.
3. **Relative Error Guarantee**: Under mild assumptions, assuming that the data stream is skewed, it is proved that the proposed algorithms also achieve a relative error guarantee.
4. **Mergeability**: A merge algorithm is provided, and it is proved that all the proposed algorithms are mergeable in the bounded - deletion model, which is crucial for distributed applications.
### Formula Summary
- **Frequency Estimation Problem**:
\[
\forall x\in U, |f(x)-\hat{f}(x)|\leq\epsilon F_1
\]
where $F_1 = \sum_{x\in U}f(x)$.
- **Heavy Hitters Problem**:
Given a parameter $\epsilon>0$, heavy hitters are items with a frequency of at least $\epsilon F_1$.
- **Error Bound**:
\[
|f(x)-\hat{f}(x)|\leq\frac{\epsilon}{m}F_1
\]
- **Relative Error Bound**:
\[
|f_i-\hat{f}_i|\leq\epsilon f_i, \forall i\leq k
\]
Through these improvements, the paper overcomes the limitations of existing algorithms in handling the interleaving of insertions and deletions and provides a more efficient solution for large - data - stream processing.