Discovering Useful Compact Sets of Sequential Rules in a Long Sequence

Erwan Bourrand,Luis Galárraga,Esther Galbrun,Elisa Fromont,Alexandre Termier
DOI: https://doi.org/10.48550/arXiv.2109.07519
2022-12-30
Abstract:We are interested in understanding the underlying generation process for long sequences of symbolic events. To do so, we propose COSSU, an algorithm to mine small and meaningful sets of sequential rules. The rules are selected using an MDL-inspired criterion that favors compactness and relies on a novel rule-based encoding scheme for sequences. Our evaluation shows that COSSU can successfully retrieve relevant sets of closed sequential rules from a long sequence. Such rules constitute an interpretable model that exhibits competitive accuracy for the tasks of next-element prediction and classification.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to discover useful and compact sets of sequential rules from long sequences. Specifically, the paper focuses on understanding the generation process of long symbolic event sequences and proposes an algorithm named C OSSU to mine small and meaningful sets of sequential rules. These rules are selected through an encoding scheme based on the MDL (Minimum Description Length) criterion to ensure the compactness of the rule set. ### Specific Background of the Problem 1. **Importance of Long - Sequence Data** - Long - sequence data is very common in many fields, such as DNA sequences, server logs, network packet traces, and long texts. - Discovering the patterns in these sequences helps to better understand the sequence generation process and can be used for diagnosis and prediction. 2. **Limitations of Existing Methods** - Most of the existing sequential rule mining methods assume that the input is a short - sequence database rather than a long sequence. Splitting a long sequence into short sequences will lead to the loss of boundary information. - Sequential rule mining faces the so - called "pattern explosion" problem, that is, due to the combinatorial search space, millions of sequential rules may be generated. ### Goals of the Paper The paper proposes the first sequential rule mining method that can directly process long sequences and output compact rule sets. Specific goals include: - **Compactness and Interpretability**: Select a compact rule set through the MDL criterion to ensure that the rule set can not only compress data but also has high interpretability. - **Predictive Ability**: Verify the effectiveness of the discovered rules in the next - element prediction and classification tasks. ### Solution The C OSSU algorithm proposed in the paper mainly includes the following steps: 1. **Rule Construction**: Extract closed frequent subsequences from the input sequence and generate candidate rules. 2. **Rule Selection**: Use a greedy strategy to gradually add rules to the rule set while adjusting the rule weights to optimize the encoding length. 3. **Rule Evaluation**: Verify the effectiveness of the discovered rules through experiments, especially their performance on synthetic data and real - world data. ### Experimental Results The paper verifies the effectiveness of the C OSSU algorithm through a series of experiments, including: - **Synthetic Data Experiments**: Study the performance of the algorithm under different insertion probabilities, alphabet sizes, sequence lengths, and rule sizes. - **Prediction Task Experiments**: Test the performance of C OSSU in the next - event prediction task on real - life log data and compare it with baseline methods and other models (such as Bigram, HMM). In general, the paper aims to solve the challenges of sequential rule mining in long sequences and provide a solution that can effectively compress data and has high interpretability.