Abstract:Frequent itemset mining is a popular and important first step in the analysis of data arising in a broad range of applications. The traditional “exact” model for frequent itemsets requires that every item occur in each supporting transaction. However, real data is typically subject to noise and measurement error. To date, the effect of noise on exact frequent pattern mining algorithms have been addressed primarily through simulation studies, and there has been limited attention to the development of noise tolerant algorithms. In this paper we propose a noise tolerant itemset model, which we call approximate frequent itemsets (AFI). Like frequent itemsets, the AFI model requires that an itemset has a minimum number of supporting transactions. However, the AFI model tolerates a controlled fraction of errors in each item and each supporting transaction. Motivating this model are theoretical results (and a supporting simulation study presented here) which state that, in the presence of even low levels of noise, large frequent itemsets are broken into fragments of logarithmic size; thus the itemsets cannot be recovered by a routine application of frequent itemset mining. By contrast, we provide theoretical results showing that the AFI criterion is well suited to recovery of block structures subject to noise. We developed and implemented an algorithm to mine AFIs that generalizes the level-wise enumeration of frequent itemsets by allowing noise. We propose the noise-tolerant support threshold, a relaxed version of support, which varies with the length of the itemset and the noise threshold. We exhibit an Apriori property that permits the pruning of an itemset if any of its sub-itemset is not sufficiently supported. Several experiments presented demonstrate that the AFI algorithm enables better recoverability of frequent patterns under noisy conditions than existing frequent itemset mining approaches. Noise-tolerant support pruning also renders an order of magnitude performance gain over existing methods.

MapReduce-based Parallelized Approximation of Frequent Itemsets Mining in Uncertain Data.

Mining Top-k Minimal Redundancy Frequent Patterns over Uncertain Databases.

Efficient Probabilistic Frequent Itemset Mining In Big Sparse Uncertain Data

Approximate mining of global closed frequent itemsets over data streams

Mining Noise-Tolerant Frequent Closed Itemsets in Very Large Database.

Accelerated Frequent Closed Sequential Pattern Mining for Uncertain Data

Mining Algorithm of Frequent Closed Itemsets Based on Uncertain Data

Mining frequent itemset from uncertain data

A STABLE PARALLEL DISTRIBUTED FREQUENT ITEMSET MINING ALGORITHM AND ITS APPLICATION

Mining Approximate Frequent Itemsets from Noisy Data

Frequent Pattern Mining with Uncertain Data

Mining Order-Preserving Submatrices Under Data Uncertainty: A Possible-World Approach and Efficient Approximation Methods

Frequent Pattern Mining Algorithms With Uncertain Data

Mining Uncertain Sequential Patterns in Iterative MapReduce

PFIMD: a parallel MapReduce-based algorithm for frequent itemset mining

MapReduce-based Closed Frequent Itemset Mining with Efficient Redundancy Filtering

Mining Approximate Frequent Itemsets in the Presence of Noise: Algorithm and Analysis

Efficient Algorithm for Mining of Frequent Itemsets over Uncertain Data Streams

Integrity Verification for Outsourcing Uncertain Frequent Itemset Mining

YAFIM: A Parallel Frequent Itemset Mining Algorithm with Spark

MapReduce-Based Balanced Mining for Closed Frequent Itemset