Abstract:We propose a new algorithm for k-means clustering in a distributed setting, where the data is distributed across many machines, and a coordinator communicates with these machines to calculate the output clustering. Our algorithm guarantees a cost approximation factor and a number of communication rounds that depend only on the computational capacity of the coordinator. Moreover, the algorithm includes a built-in stopping mechanism, which allows it to use fewer communication rounds whenever possible. We show both theoretically and empirically that in many natural cases, indeed 1-4 rounds suffice. In comparison with the popular k-means|| algorithm, our approach allows exploiting a larger coordinator capacity to obtain a smaller number of rounds. Our experiments show that the k-means cost obtained by the proposed algorithm is usually better than the cost obtained by k-means||, even when the latter is allowed a larger number of rounds. Moreover, the machine running time in our approach is considerably smaller than that of k-means||. Code for running the algorithm and experiments is available at <a class="link-external link-https" href="https://github.com/selotape/distributed_k_means" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to reduce the number of communication rounds and ensure the clustering quality when performing k - means clustering in a distributed computing environment. Specifically: 1. **Reducing the number of communication rounds**: In distributed k - means clustering, data is distributed on multiple machines, and the coordinator needs to communicate with these machines in multiple rounds to complete the clustering task. Each round of communication brings synchronization and communication overheads, so reducing the number of communication rounds is the key to improving the algorithm efficiency. 2. **Ensuring the clustering quality**: Although reducing the number of communication rounds can improve efficiency, it must be ensured that the quality of the clustering results does not decline significantly. The goal of the paper is to design an algorithm that can achieve high - quality clustering within a relatively small number of communication rounds. ### The solution proposed in the paper The paper proposes a new distributed k - means clustering algorithm named SOCCER (Sampling, Optimal Clustering Cost Estimation, Removal), with the following main features: - **Adaptive stopping mechanism**: SOCCER can automatically decide when to stop according to the characteristics of the data set, thus avoiding unnecessary communication rounds. This enables SOCCER to complete the clustering task within 1 to 4 rounds of communication on many natural data sets. - **Dependence on the coordinator's computing power**: SOCCER utilizes the stronger computing power of the coordinator to perform partially centralized clustering calculations, and estimates the optimal k - means cost based on these calculation results, thereby guiding the point deletion operations on each machine. - **Theoretical guarantee**: SOCCER provides a theoretical approximation factor guarantee, and this approximation factor only depends on the coordinator's computing power without the need to assume that the optimal clustering cost is far from zero. ### Comparison with existing methods - **Compared with k - means||**: - k - means|| is a popular distributed k - means algorithm, but it does not have an adaptive stopping mechanism and usually needs to set the hyper - parameter of the number of communication rounds. In addition, k - means|| may require more communication rounds to obtain a better clustering effect in some cases. - SOCCER shows better performance in actual experiments. Usually, it only needs fewer communication rounds to obtain a better clustering cost than k - means||, and the machine running time is also shorter. - **Compared with EIM11 (Ene et al., 2011)**: - Although EIM11 also has a theoretical guarantee, it has significant disadvantages in practice, such as always using the worst - case number of communication rounds and transmitting a large amount of data each time, resulting in an excessive computational burden on the machine side. - SOCCER reduces the amount of transmitted data and the computational burden on the machine side through a more efficient point deletion strategy, making it more competitive in practical applications. ### Experimental verification The paper verifies the effectiveness of SOCCER through experiments on synthetic data sets and real data sets. The experimental results show that in many cases, SOCCER indeed only needs a small number of communication rounds to complete high - quality clustering tasks, and is also superior to other methods in terms of machine running time. ### Summary The main contribution of this paper is to propose a new distributed k - means clustering algorithm, SOCCER, which not only provides a good approximation factor guarantee theoretically, but also shows higher efficiency and better clustering results in practical applications.

Fast Distributed k-Means with a Small Number of Rounds

K-Means Clustering with Distributed Dimensions.

Distributed Privacy-Aware Fast Selection Algorithm for Large-Scale Data.

Fast Algorithms for Distributed K-Clustering with Outliers.

Clustering with Distributed Data

Distributed Information Theoretic Clustering

$k$-Center Clustering in Distributed Models

A Scalable Algorithm for Individually Fair K-means Clustering

Distributed k-Means with Outliers in General Metrics

Fast K-Means Based on KNN Graph

Scalable K-Means for Large-Scale Clustering.

Distributed Kernel K-Means for Large Scale Clustering

Distributed Clustering based on Distributional Kernel

Fast k-means algorithm clustering

Sparse Embedded K-Means Clustering.

Distributed Fair k-Center Clustering Problems with Outliers

Subspace Clustering by Directly Solving Discriminative K-means

Distributed Consensus-Based K-Means Algorithm in Switching Multi-Agent Networks

New Algorithms for Distributed Fair K-Center Clustering: Almost Accurate As Sequential Algorithms.

F3KM: Federated, Fair, and Fast K-Means

A Novel Density Based Clustering Algorithm and Its Parallelization.