Large-Scale Automatic K-Means Clustering for Heterogeneous Many-Core Supercomputer

Teng Yu,Wenlai Zhao,Pan Liu,Vladimir Janjic,Xiaohan Yan,Shicai Wang,Haohuan Fu,Guangwen Yang,John Thomson
DOI: https://doi.org/10.1109/tpds.2019.2955467
IF: 5.3
2019-01-01
IEEE Transactions on Parallel and Distributed Systems
Abstract:This article presents an automatic $k$k-$means$means clustering solution targeting the Sunway TaihuLight supercomputer. We first introduce a multilevel parallel partition approach that not only partitions by dataflow and centroid, but also by dimension, which unlocks the potential of the hierarchical parallelism in the heterogeneous many-core processor and the system architecture of the supercomputer. The parallel design is able to process large-scale clustering problems with up to 196,608 dimensions and over 160,000 targeting centroids, while maintaining high performance and high scalability. Furthermore, we propose an automatic hyper-parameter determination process for $k$k-$means$means clustering, by automatically generating and executing the clustering tasks with a set of candidate hyper-parameter, and then determining the optimal hyper-parameter using a proposed evaluation method. The proposed auto-clustering solution can not only achieve high performance and scalability for problems with massive high-dimensional data, but also support clustering without sufficient prior knowledge for the number of targeted clusters, which can potentially increase the scope of $k$k-$means$means algorithm to new application areas.
What problem does this paper attempt to address?