Optimization of parallel FP-Growth algorithm based on Spark

Xiang FANG,Gongxuan ZHANG
DOI: https://doi.org/10.16652/j.issn.1004-373x.2016.08.003
2016-01-01
Abstract:The advantage of the FP?Growth algorithm for compressing data is reflected with the increasing of the data size. With the MapReduce framework,the PFP?Growth algorithm can be parallelized on the Hadoop platform. However,when processing tasks with the MapReduce framework,the intermediate results need to be written to the disk,which will affect the efficiency of the algorithm. Therefore,based on Spark platform,this algorithm was improved according to the concept of balanced grouping to improve the efficiency of association mining. In addition,if there is a long prefix,the improved algorithm will split the shared prefix. The IPFP?Growth is implemented in Spark through four steps. The experimental results show that the performance of the algorithm optimized in Spark is superior to that of the PFP?Growth algorithm.
What problem does this paper attempt to address?