Research on Parallelized Stream Data Micro Clustering Algorithm

Ke Ma,Lingjuan Li,Yimu Ji,Shengmei Luo,Tao Wen
DOI: https://doi.org/10.2991/ameii-15.2015.116
2015-01-01
Abstract:Analysis and mining of stream data is a hot research topic in recent years. In order to improve the clustering efficiency, based on MapReduce, this paper proposes a Parallelized Stream Data Micro Clustering Algorithm PSDMC for the micro-clustering phase of CluStream algorithm. PSDMC algorithm uses a series of containers to store real-time stream data according to the arrival time. Each map node produces real-time local micro-clusters per unit time (such as 1 second). The reduce node puts together these real-time local micro-clusters to produce real-time global micro- clusters by using DBSCAN and the micro clustering method of CluStream. The global micro- clusters will be used to renew local micro-clusters in every map node and be used to create snapshots to store into Pyramidal Time Frame. Analysis shows th at the efficiency of PSDMC algorithm can increase nearly linearly with the increase of map nodes while the clustering accuracy can be guaranteed.
What problem does this paper attempt to address?