Abstract:Efficient learning from streaming data is important for modern data analysis due to the continuous and rapid evolution of data streams. Despite significant advancements in stream pattern mining, challenges persist, particularly in managing complex data streams like sequential and weighted itemsets. While reservoir sampling serves as a fundamental method for randomly selecting fixed-size samples from data streams, its application to such complex patterns remains largely unexplored. In this study, we introduce an approach that harnesses a weighted reservoir to facilitate direct pattern sampling from streaming batch data, thus ensuring scalability and efficiency. We present a generic algorithm capable of addressing temporal biases and handling various pattern types, including sequential, weighted, and unweighted itemsets. Through comprehensive experiments conducted on real-world datasets, we evaluate the effectiveness of our method, showcasing its ability to construct accurate incremental online classifiers for sequential data. Our approach not only enables previously unusable online machine learning models for sequential data to achieve accuracy comparable to offline baselines but also represents significant progress in the development of incremental online sequential itemset classifiers.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is that when dealing with complex structured data streams (such as sequence item sets and weighted item sets), the existing random sampling methods cannot effectively cope with the challenges of these complex patterns. Specifically: 1. **Limitations of existing methods**: Although Reservoir Sampling is a basic method for randomly selecting a fixed - size sample from a data stream, its ability to handle complex patterns (such as sequence item sets and weighted item sets) is limited. Traditional methods are difficult to handle large - scale and rapidly changing complex structured data. 2. **Temporal bias and diversity of pattern types**: Existing methods are difficult to effectively handle temporal biases, and have insufficient support for different types of patterns (such as sequences, weighted and unweighted item sets). 3. **Accuracy of online classifiers**: When constructing online classifiers, existing methods are difficult to ensure accurate classification of sequence data, especially in the case of the emergence of new labels. To solve these problems, the author proposes a new method - **RPS (Reservoir Patterns Sampler)**. This method ensures scalability and efficiency by introducing a weighted reservoir to directly perform pattern sampling from batch - flow data. The main contributions of RPS include: - Proposing the first reservoir pattern sampling method suitable for complex structured data (such as sequence item sets and weighted item sets). - Designing a general algorithm that can handle temporal biases and combine multiple interestingness measures (such as frequency, area and attenuation) and norm - based utility to avoid the long - tail problem. - Proving through experiments that sampling patterns can be used to construct efficient online classifiers, especially for sequence data classification tasks with new labels. Therefore, this paper aims to improve the processing ability of complex structured data streams and the accuracy and efficiency of online classifiers by proposing a new sampling method. ### Formula summary 1. **Global Pattern Utility**: \[ G_m(\phi, D)=\sum_{(t, B)\in D}\left(\sum_{\gamma\in B}m(\phi, \gamma)\right) \] 2. **Pattern Global Utility under temporal bias**: \[ G^\epsilon_m(\phi, D)=\sum_{(t_i, B_i)\in D}\left(\sum_{\gamma_{ij}\in B_i}m(\phi, \gamma_{ij})\times\nabla_\epsilon(t_n, t_i)\right) \] where \(\nabla_\epsilon(t_n, t_i)=e^{-(t_n - t_i)\times\epsilon}\) is the time - decay function. 3. **Batch acceptance probability**: \[ p_j = \frac{\omega_m(B_j)\times e^{\epsilon\times t_j}}{Z_i} \] where \(Z_i = Z_{i - 1}+\omega_m(B_i)\times e^{\epsilon\times t_i}\) is the normalization constant. 4. **Cumulative Binomial Probability Distribution (CBPD)**: \[ P(n_r, k)=\sum_{i = n_r}^k\binom{k}{i}p^i(1 - p)^{k - i} \] 5. **Incomplete Beta Function**

RPS: A Generic Reservoir Patterns Sampler

Adaptive-Size Reservoir Sampling over Data Streams

Communication-Efficient (Weighted) Reservoir Sampling from Fully Distributed Data Streams

Stabilizing Linear Passive-Aggressive Online Learning with Weighted Reservoir Sampling

Progressive online aggregation in a distributed stream system

Efficient Discovery of Emerging Frequent Patterns in ArbitraryWindows on Data Streams

Continuously Distinct Sampling over Centralized and Distributed High Speed Data Streams

Feature-Selected and -Preserved Sampling for High-Dimensional Stream Data Summary

Concept Drift Based Multi-dimensional Data Streams Sampling Method.

High-performance dynamic pattern matching over disordered streams

Pgg: an Online Pattern Based Approach for Stream Variation Management

Diversity-Based Load Shedding Strategy over Pattern Streams

A Heuristic Method for Unstructured Pattern Management over Data Streams.

Stratified and time-aware sampling based adaptive ensemble learning for streaming recommendations

A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework

Scalable Sampling for High Utility Patterns

Mining Scalable Pattern Based on Temporal Logic over Data Streams

Estimation and maintenance of frequent pattern on data streams

OCPM:An Online Composite Pattern Matching Method over Data Streams

Stream Aggregation Through Order Sampling

PEDS-VM: A Variation Management Prototype for Pattern Evolving Data Streams