RPS: A Generic Reservoir Patterns Sampler

Lamine Diop,Marc Plantevit,Arnaud Soulet
2024-11-01
Abstract:Efficient learning from streaming data is important for modern data analysis due to the continuous and rapid evolution of data streams. Despite significant advancements in stream pattern mining, challenges persist, particularly in managing complex data streams like sequential and weighted itemsets. While reservoir sampling serves as a fundamental method for randomly selecting fixed-size samples from data streams, its application to such complex patterns remains largely unexplored. In this study, we introduce an approach that harnesses a weighted reservoir to facilitate direct pattern sampling from streaming batch data, thus ensuring scalability and efficiency. We present a generic algorithm capable of addressing temporal biases and handling various pattern types, including sequential, weighted, and unweighted itemsets. Through comprehensive experiments conducted on real-world datasets, we evaluate the effectiveness of our method, showcasing its ability to construct accurate incremental online classifiers for sequential data. Our approach not only enables previously unusable online machine learning models for sequential data to achieve accuracy comparable to offline baselines but also represents significant progress in the development of incremental online sequential itemset classifiers.
Machine Learning,Artificial Intelligence,Combinatorics,Probability
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is that when dealing with complex structured data streams (such as sequence item sets and weighted item sets), the existing random sampling methods cannot effectively cope with the challenges of these complex patterns. Specifically: 1. **Limitations of existing methods**: Although Reservoir Sampling is a basic method for randomly selecting a fixed - size sample from a data stream, its ability to handle complex patterns (such as sequence item sets and weighted item sets) is limited. Traditional methods are difficult to handle large - scale and rapidly changing complex structured data. 2. **Temporal bias and diversity of pattern types**: Existing methods are difficult to effectively handle temporal biases, and have insufficient support for different types of patterns (such as sequences, weighted and unweighted item sets). 3. **Accuracy of online classifiers**: When constructing online classifiers, existing methods are difficult to ensure accurate classification of sequence data, especially in the case of the emergence of new labels. To solve these problems, the author proposes a new method - **RPS (Reservoir Patterns Sampler)**. This method ensures scalability and efficiency by introducing a weighted reservoir to directly perform pattern sampling from batch - flow data. The main contributions of RPS include: - Proposing the first reservoir pattern sampling method suitable for complex structured data (such as sequence item sets and weighted item sets). - Designing a general algorithm that can handle temporal biases and combine multiple interestingness measures (such as frequency, area and attenuation) and norm - based utility to avoid the long - tail problem. - Proving through experiments that sampling patterns can be used to construct efficient online classifiers, especially for sequence data classification tasks with new labels. Therefore, this paper aims to improve the processing ability of complex structured data streams and the accuracy and efficiency of online classifiers by proposing a new sampling method. ### Formula summary 1. **Global Pattern Utility**: \[ G_m(\phi, D)=\sum_{(t, B)\in D}\left(\sum_{\gamma\in B}m(\phi, \gamma)\right) \] 2. **Pattern Global Utility under temporal bias**: \[ G^\epsilon_m(\phi, D)=\sum_{(t_i, B_i)\in D}\left(\sum_{\gamma_{ij}\in B_i}m(\phi, \gamma_{ij})\times\nabla_\epsilon(t_n, t_i)\right) \] where \(\nabla_\epsilon(t_n, t_i)=e^{-(t_n - t_i)\times\epsilon}\) is the time - decay function. 3. **Batch acceptance probability**: \[ p_j = \frac{\omega_m(B_j)\times e^{\epsilon\times t_j}}{Z_i} \] where \(Z_i = Z_{i - 1}+\omega_m(B_i)\times e^{\epsilon\times t_i}\) is the normalization constant. 4. **Cumulative Binomial Probability Distribution (CBPD)**: \[ P(n_r, k)=\sum_{i = n_r}^k\binom{k}{i}p^i(1 - p)^{k - i} \] 5. **Incomplete Beta Function**