Explainable $k$-Means and $k$-Medians Clustering

Sanjoy Dasgupta,Nave Frost,Michal Moshkovitz,Cyrus Rashtchian

DOI: https://doi.org/10.48550/arXiv.2002.12538

2020-09-22

Abstract:Clustering is a popular form of unsupervised learning for geometric data. Unfortunately, many clustering algorithms lead to cluster assignments that are hard to explain, partially because they depend on all the features of the data in a complicated way. To improve interpretability, we consider using a small decision tree to partition a data set into clusters, so that clusters can be characterized in a straightforward manner. We study this problem from a theoretical viewpoint, measuring cluster quality by the $k$-means and $k$-medians objectives: Must there exist a tree-induced clustering whose cost is comparable to that of the best unconstrained clustering, and if so, how can it be found? In terms of negative results, we show, first, that popular top-down decision tree algorithms may lead to clusterings with arbitrarily large cost, and second, that any tree-induced clustering must in general incur an $\Omega(\log k)$ approximation factor compared to the optimal clustering. On the positive side, we design an efficient algorithm that produces explainable clusters using a tree with $k$ leaves. For two means/medians, we show that a single threshold cut suffices to achieve a constant factor approximation, and we give nearly-matching lower bounds. For general $k \geq 2$, our algorithm is an $O(k)$ approximation to the optimal $k$-medians and an $O(k^2)$ approximation to the optimal $k$-means. Prior to our work, no algorithms were known with provable guarantees independent of dimension and input size.

Machine Learning,Computational Geometry,Data Structures and Algorithms

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to improve the interpretability of clustering algorithms. Specifically, although traditional clustering algorithms (such as k - means and k - medians) can effectively group data, the resulting clustering is often difficult to interpret because these algorithms rely on all features of the data in a complex way. This limits the user's ability to understand the commonalities between points within the same cluster or the differences between points in different clusters. To solve this problem, the paper proposes a method of dividing the data set using small decision trees, so that each cluster can be described in a simple and intuitive way. The author studies the effectiveness of this method from a theoretical perspective, mainly focusing on whether the quality of tree - based clustering can be compared with that of the optimal unconstrained clustering, and if so, how to find such clustering. By designing efficient algorithms, the paper proves that for two clusters (i.e., k = 2), a single - threshold cut is sufficient to achieve a constant - factor approximation; and for the general case of k ≥ 2, the proposed algorithm can achieve an O(k) approximation to the optimal k - medians and an O(k^2) approximation to the optimal k - means. These results are independent of the data dimension and input scale, thus providing a new perspective and solution for interpretable clustering.

Explainable $k$-Means and $k$-Medians Clustering

The Price of Explainability for Clustering

Almost-linear Time Approximation Algorithm to Euclidean $k$-median and $k$-means

Explaining Kernel Clustering via Decision Trees

Subspace Clustering by Directly Solving Discriminative K-means

Cluster-level Group Representativity Fairness in $k$-means Clustering

Optimal Time Bounds for Approximate Clustering

Fully Dynamic $k$-Median with Near-Optimal Update Time and Recourse

Randomized Dimensionality Reduction for k-means Clustering

A Scalable Algorithm for Individually Fair K-means Clustering

Replicable Clustering

Multi-Prototypes Convex Merging Based K-Means Clustering Algorithm

Fully Dynamic $k$-Clustering with Fast Update Time and Small Recourse

K – Means Algorithm

Streaming Euclidean $k$-median and $k$-means with $o(\log n)$ Space

MeanCut: A Greedy-Optimized Graph Clustering via Path-based Similarity and Degree Descent Criterion

Faster K-Means Cluster Estimation

Simple, Scalable and Effective Clustering via One-Dimensional Projections

Fully Dynamic k-Means Coreset in Near-Optimal Update Time

Clustering Stable Instances of Euclidean k-means

Hybrid k-Clustering: Blending k-Median and k-Center