Grid Partition and Agglomeration for Bidirectional Hierarchical Clustering
Lei Wu,Hechang Chen,Xiangchun Yu,Sun Chao,Zhezhou Yu,RuiTing Dou
DOI: https://doi.org/10.1007/978-3-030-21373-2_60
2019-01-01
Abstract:Clustering is an important data processing tool, which can be used to reveal the distribution structure of unfamiliar domain data, or as preprocess methods to magnify data object to accelerate subsequent processing or simplify models. However, the distribution of many real-world data in feature space is very complex or uneven. Besides, the similarity/distance is not easy to be properly defined in feature space with different dimensional quantity. Therefore, many existing clustering algorithms are not stable in real datasets, and better performance of different datasets relies on artificial special design, such as scale normalization. In this paper, we propose a bidirectional hierarchical clustering (BHC) algorithm with two phases. In the first phase (Top-down), based on the probability density function of data in different dimensions, the feature space is divided into over-segmented grids to adapt to the complex distribution of data. In the second phase (Bottom-up), based on statistical information, a robust distance instead of geometrical distance is defined to agglomerate the grids into a dendrogram. Compared with the individual data points, grids created in the first phase can carry more statistical information, and the magnified processing objects can accelerate the clustering process. The second phase enhances the algorithm’s ability by the ability of recognize arbitrary shape data clusters. The effectiveness of BHC is compared with 20 popular or recent clustering algorithms on 8 artificial datasets and 6 real-world datasets. And the results show that our algorithm can achieve good results on most datasets. In particular, BHC surpasses all the comparison algorithms involved in the experiment on all real-world datasets. In addition, in order to test the efficiency of the algorithm, we design an experiment which can test the influence of dimension and data size on the operation time.