Abstract:We are interested in understanding the underlying generation process for long sequences of symbolic events. To do so, we propose COSSU, an algorithm to mine small and meaningful sets of sequential rules. The rules are selected using an MDL-inspired criterion that favors compactness and relies on a novel rule-based encoding scheme for sequences. Our evaluation shows that COSSU can successfully retrieve relevant sets of closed sequential rules from a long sequence. Such rules constitute an interpretable model that exhibits competitive accuracy for the tasks of next-element prediction and classification.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to discover useful and compact sets of sequential rules from long sequences. Specifically, the paper focuses on understanding the generation process of long symbolic event sequences and proposes an algorithm named C OSSU to mine small and meaningful sets of sequential rules. These rules are selected through an encoding scheme based on the MDL (Minimum Description Length) criterion to ensure the compactness of the rule set. ### Specific Background of the Problem 1. **Importance of Long - Sequence Data** - Long - sequence data is very common in many fields, such as DNA sequences, server logs, network packet traces, and long texts. - Discovering the patterns in these sequences helps to better understand the sequence generation process and can be used for diagnosis and prediction. 2. **Limitations of Existing Methods** - Most of the existing sequential rule mining methods assume that the input is a short - sequence database rather than a long sequence. Splitting a long sequence into short sequences will lead to the loss of boundary information. - Sequential rule mining faces the so - called "pattern explosion" problem, that is, due to the combinatorial search space, millions of sequential rules may be generated. ### Goals of the Paper The paper proposes the first sequential rule mining method that can directly process long sequences and output compact rule sets. Specific goals include: - **Compactness and Interpretability**: Select a compact rule set through the MDL criterion to ensure that the rule set can not only compress data but also has high interpretability. - **Predictive Ability**: Verify the effectiveness of the discovered rules in the next - element prediction and classification tasks. ### Solution The C OSSU algorithm proposed in the paper mainly includes the following steps: 1. **Rule Construction**: Extract closed frequent subsequences from the input sequence and generate candidate rules. 2. **Rule Selection**: Use a greedy strategy to gradually add rules to the rule set while adjusting the rule weights to optimize the encoding length. 3. **Rule Evaluation**: Verify the effectiveness of the discovered rules through experiments, especially their performance on synthetic data and real - world data. ### Experimental Results The paper verifies the effectiveness of the C OSSU algorithm through a series of experiments, including: - **Synthetic Data Experiments**: Study the performance of the algorithm under different insertion probabilities, alphabet sizes, sequence lengths, and rule sizes. - **Prediction Task Experiments**: Test the performance of C OSSU in the next - event prediction task on real - life log data and compare it with baseline methods and other models (such as Bigram, HMM). In general, the paper aims to solve the challenges of sequential rule mining in long sequences and provide a solution that can effectively compress data and has high interpretability.

Discovering Useful Compact Sets of Sequential Rules in a Long Sequence

Efficient Mining Of Recurrent Rules From A Sequence Database

Online semi-supervised learning of composite event rules by combining structure and mass-based predicate similarity

MCoR-Miner: Maximal Co-Occurrence Nonoverlapping Sequential Rule Mining

A Compact DAG for Storing and Searching Maximal Common Subsequences

Anomaly Rule Detection in Sequence Data

Specialized Mathematical Solving by a Step-By-Step Expression Chain Generation

TSRuleGrowth : Extraction de règles de prédiction semi-ordonnées à partir d'une série temporelle d'éléments discrets, application dans un contexte d'intelligence ambiante

Hiérarchisation des règles d'association en fouille de textes

Learning Classifier System Ensemble and Compact Rule Set

Closed-set-based Discovery of Bases of Association Rules

Lissard: Long and Simple Sequential Reasoning Datasets

Scalable Rule Lists Learning with Sampling

Ensemble learning classifier system and compact ruleset

How to Mine Information from Each Instance to Extract an Abbreviated and Credible Logical Rule

MLissard: Multilingual Long and Simple Sequential Reasoning Benchmarks

Motif-based Rule Discovery for Predicting Real-valued Time Series

Beyond Markov Logic: Efficient Mining of Prediction Rules in Large Graphs

Efficiently Learning Probabilistic Logical Models by Cheaply Ranking Mined Rules

Efficient Exploration of the Rashomon Set of Rule Set Models

Bridging Code Semantic and LLMs: Semantic Chain-of-Thought Prompting for Code Generation