CPI-model-based analysis of sparse k-means clustering algorithms

Kazuo Aoyama,Kazumi Saito,Tetsuo Ikeda
DOI: https://doi.org/10.1007/s41060-021-00270-4
2021-06-15
International Journal of Data Science and Analytics
Abstract:Standard k-means clustering algorithms have been widely used to solve the partitioning problems of a given data set into k disjoint subsets. When a data set is large-scale and high-dimensional sparse, such as text data with a bag-of-words representation, it is not trivial which representations are adopted for both the data and mean sets. Additionally, algorithms that differ only in their representations need distinct elapsed times until their convergences, despite starting at an identical initial state and executing an identical number of similarity calculations, which is a conventional indicator of speed performance. We design sparse k-means clustering algorithms that utilize distinct representations, each of which is a pair of a data structure and an expression. Our purpose is to clarify the cause of their performance differences and identify the best algorithm when they are executed in a modern computer system. We analyze the algorithms with a simple yet practical clock-cycle per instruction (CPI) model that is expressed as a linear combination of four performance degradation factors in a modern computer system: the completed instructions, the level-1 and last-level cache misses, and the branch mispredictions. We also optimize the model parameters by a newly introduced procedure and demonstrate that CPIs calculated with our model agree well with experimental results when the algorithms are applied to large-scale and high-dimensional real document data sets. Furthermore, our model clarifies that the best algorithm among them suppresses the performance degradation factors of the number of cache misses, the branch mispredictions, and the completed instructions.
What problem does this paper attempt to address?