Big Data Mining Platform Based on Cloud Computing

HE Qing,ZHUANG Fuzhen
DOI: https://doi.org/10.3969/j.issn.1009-6868.2013.04.006
2013-01-01
Abstract:In this paper, we develop a parallel and distributed data mining toolkit platform called PDMiner. This platform is based on cloud computing. PDMiner is used to preprocess data, analyze association rules, and parallel classification and clustering.Our experimental results show that the parallel algorithms in PDMiner can tackle data sets up to one terabyte. They are very efficient because they have good speedup, and they are easily extended so that they can be executed in a cluster of commodity machines. This means that full use is made of computing resources. The algorithms are also efficient for practical data mining. We also develop a knowledge flow subsystem that helps the user define a data mining task in PDMiner.
What problem does this paper attempt to address?