Mining Productive Itemsets in Dynamic Databases

Xiang Li,Jiaxuan Li,Philippe Fournier-Viger,M. Saqib Nawaz,Jie Yao,Jerry Chun-Wei Lin
DOI: https://doi.org/10.1109/access.2020.3012817
IF: 3.9
2020-01-01
IEEE Access
Abstract:Discovering frequent itemsets is a data analysis task used in numerous domains. It consists of finding sets of items (itemsets) that frequently appear in a set of database records (also called transactions). Though discovering frequent itemsets is useful, it can produce a large amount of spurious patterns. As a result, the user may spend a great amount of time to analyze the itemsets found by a frequent itemset mining algorithm to find truly interesting patterns. Hence, in recent years, a key research topic has emerged which is to discover statistically significant patterns in databases. The most popular model for identifying itemsets that are statistically significant is to discover non-redundant productive itemsets. The state-of-the-art algorithm to extract this set of patterns is OPUS-Miner. A key drawback of that algorithm is that it is designed to be applied to a static database. Moreover, a second drawback of OPUS-Miner is that it discovers all patterns in a database. In other words, the user cannot search for itemsets containing some specific items. This paper addresses these issues by defining the novel problem of discovering targeted non redundant productive itemsets in dynamic databases. An algorithm named IDPI+ (Interactive Discovery of Productive Itemsets) is presented, storing transactions in a tree structure, which can then be interactively queried to identify productive and non redundant itemsets containing specific items. A structure named Query-Tree is also introduced to process many queries at the same time. Moreover, to handle dynamic databases, efficient transaction insertion and deletion algorithms are provided to update the tree. It was observed in an experimental evaluation on benchmark datasets containing various types of data that IDPI+ can handle thousands of queries per second on a desktop computer. Moreover, it was found that IPDI+ is more than an order of magnitude faster than a baseline algorithm.
What problem does this paper attempt to address?