Parallel Density-Based Clustering Algorithm by Using Weighted Grid and Information Entropy

Jian HU,Kaibin XU,Yimin MAO
DOI: https://doi.org/10.3778/j.issn.1673-9418.1912034
2020-01-01
Abstract:Aiming at the problems of unreasonable division of data gridding, low accuracy of clustering results and low efficiency of parallelization in big data clustering algorithm based on density, this paper proposes a density-based clustering algorithm by using weighted grid and information entropy based on MapReduce, named DBWGIE-MR. Firstly, an adaptive division grid (ADG) strategy is proposed to divide the cell of grid adaptively. Secondly, a weighted grid construction strategy, neighboring expand (NE) which can strengthen relevance between grids is designed to improve the accuracy of clustering. Meanwhile, based on weighted grid and information entropy (WGIE), a density calculation strategy is designed to calculate the density of grid. In addition, the ε-neighborhood and core object of density-based clustering algorithm are recalculated, which is suitable for weighted grid. Then, COMCORE-MR (core clusters computing algorithm based on MapReduce) algorithm is proposed to compute the local clusters of clustering algorithm in parallel. Finally, based on disjoint-set and MapReduce, MECORE-MR (merge core cluster by using MapReduce) algorithm is proposed to speed up the convergence speed of merging local clusters, which improves the local clusters merging efficiency of density-based clustering algorithm. The experimental results show that the DBWGIE-MR algorithm has better clustering results and performs better parallelization in large scale dataset.
What problem does this paper attempt to address?