An Efficient Parallel Nonlinear Clustering Algorithm Using Mapreduce

Xiang-You Peng,Yu-Bo Yang,Chang-Dong Wang,Dong Huang,Jian-Huang Lai
DOI: https://doi.org/10.1109/ipdpsw.2016.7
2016-01-01
Abstract:With the amount of data increasing rapidly, how to improve the scalability of nonlinear clustering has become a very crucial and challenging problem. In this paper, we design an efficient parallel nonlinear clustering algorithm by using a four-stage MapReduce framework. In our approach, we need to compute two quantities based on distance matrices, which, however, is difficult to compute in a MapReduce framework. To address this issue, we propose to process the data in a streaming manner to compute the distance between points while ensuring that the output of the original nonlinear clustering algorithm is unchanged. Our algorithm is able to compute the distances between points in parallel, and use these distances to compute the density and the min-distances, with the help of which we can further determine the centers of clusters and therefore discover nonlinear clusters. Extensive experiments have been conducted to demonstrate the efficiency of the proposed approach.
What problem does this paper attempt to address?