Design and Implementation of Parallel DBSCAN Algorithm Based on Spark

HUANG Ming-ji,ZHANG Qian
2017-01-01
Computer Science
Abstract:With the cloud application requiring less running time and higher performance as well as memory prices continuing to decline,the technology of memory-based distributed computing framework,such as Spark,has received unprecedented attention and is widely applied.We designed and implemented parallelized DBSCAN algorithm based on Spark,which can minimize the shuffle frequency and the data amount in shuffle.In order to optimize the algorithm parallelization strategy,we found the bottleneck of algorithm performance through the analysis and described the DAG of parallel DBSCAN algorithm based on Spark.Finally,we compared the performance of parallel DBSCAN algorithm with the DBSCAN algorithm,and analyzed the influence of different parameters on clustering results.Experimental results show that,compared with DBSCAN algorithm,the running time of parallel DBSCAN algorithm based on Spark decreases 37.2% and the speedup reaches 1.6 respectively on a data set containing three million lines without obvious loss of clustering accuracy.
What problem does this paper attempt to address?