Parallel Subspace Clustering Using MapReduce

Jia-ming DONG,Mao PAN,Chi ZHANG
DOI: https://doi.org/10.3969/j.issn.1671-1815.2017.15.015
2017-01-01
Abstract:With the data volume we create growing fast,the problem of subspace clustering of very large scale moderate-to-high dimensional dataset becomes highly important.But most subspace clustering methods can't efficiently solve this problem due to serial processing on single machine.Thus,Sample-Ignore Subspace Clustering using MapReduce(SISCMR) was proposed that can effectively solve this question.SISCMR has a great adaptability as it can use most serial clustering methods as a plugged-in clustering subroutine.Through many experiments on real and synthetic data with billions of points,it's proved with good clustering quality,near-linear scalability and high efficiency.Using 128 cores,it only took 10 minutes to cluster one of our biggest experiment dataset with 0.2 TB volume,which proves the feasibility of parallel clustering using MapReduce.
What problem does this paper attempt to address?