MFS-SubSC: an efficient algorithm for mining frequent sequences with sub-sequence constraint

Hai Duong,Anh Tran
DOI: https://doi.org/10.1007/s10115-024-02148-w
IF: 2.7
2024-06-13
Knowledge and Information Systems
Abstract:Mining frequent sequences (FS) with constraints in a sequence database (SDB) are a critical task in Data Mining, as it forms the basis for discovering meaningful patterns within sequential data. However, traditional algorithms tackling the direct mining of constrained FSs from the SDB often exhibit poor performance, especially when dealing with large SDBs and low support thresholds. Moreover, constraint-based sequence mining algorithms face additional challenges, such as increased runtime and memory usage, particularly when constraints change frequently. To address these issues, this paper introduces an efficient method for generating FSs that include a user-defined sub-sequence. Specifically, the discovered FSs must be super-sequences of the given sub-sequence. Rather than directly discovering these sequences from a sequence database (SDB) in the traditional manner, the proposed method quickly generates constrained FSs from frequent closed sequences (FCS) and frequent generator sequences (FGS). This process involves categorizing constrained FSs into equivalence classes; each represented by FCSs and FGSs. An efficient method is then adapted to swiftly generate constrained FSs within each class based on the representative elements, which are FCSs and FGSs. Additionally, a novel technique called Constraint Satisfaction Technique (CST) is introduced to circumvent computationally expensive checks for the inclusion relation among sequences during the generation process. Furthermore, a novel algorithm named MFS-SubSC is developed based on the proposed theoretical results to generate all constrained FSs efficiently. Experimental results demonstrate that the proposed algorithm surpasses state-of-the-art methods in terms of runtime, memory usage, and scalability.
computer science, information systems, artificial intelligence
What problem does this paper attempt to address?