A Parallel GPU-Based Approach to Clustering Very Fast Data Streams

Pengtao Huang,Xiu Li,Bo Yuan
DOI: https://doi.org/10.1145/2806416.2806545
2015-01-01
Abstract:Clustering data streams has become a hot topic in the era of big data. Driven by the ever increasing volume, velocity and variety of data, more efficient algorithms for clustering large-scale complex data streams are needed. In this paper, we present a parallel algorithm called PaStream, which is based on advanced Graphics Processing Unit (GPU) and follows the online-offline framework of CluStream. Our approach can achieve hundreds of times speedup on high-speed and high-dimensional data streams compared with CluStream. It can also discover clusters with arbitrary shapes and handle outliers properly. The efficiency and scalability of PaStream are demonstrated through comprehensive experiments on synthetic and standard benchmark datasets with various problem factors.
What problem does this paper attempt to address?