Explainable $k$-Means and $k$-Medians Clustering

Sanjoy Dasgupta,Nave Frost,Michal Moshkovitz,Cyrus Rashtchian
DOI: https://doi.org/10.48550/arXiv.2002.12538
2020-09-22
Abstract:Clustering is a popular form of unsupervised learning for geometric data. Unfortunately, many clustering algorithms lead to cluster assignments that are hard to explain, partially because they depend on all the features of the data in a complicated way. To improve interpretability, we consider using a small decision tree to partition a data set into clusters, so that clusters can be characterized in a straightforward manner. We study this problem from a theoretical viewpoint, measuring cluster quality by the $k$-means and $k$-medians objectives: Must there exist a tree-induced clustering whose cost is comparable to that of the best unconstrained clustering, and if so, how can it be found? In terms of negative results, we show, first, that popular top-down decision tree algorithms may lead to clusterings with arbitrarily large cost, and second, that any tree-induced clustering must in general incur an $\Omega(\log k)$ approximation factor compared to the optimal clustering. On the positive side, we design an efficient algorithm that produces explainable clusters using a tree with $k$ leaves. For two means/medians, we show that a single threshold cut suffices to achieve a constant factor approximation, and we give nearly-matching lower bounds. For general $k \geq 2$, our algorithm is an $O(k)$ approximation to the optimal $k$-medians and an $O(k^2)$ approximation to the optimal $k$-means. Prior to our work, no algorithms were known with provable guarantees independent of dimension and input size.
Machine Learning,Computational Geometry,Data Structures and Algorithms
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the interpretability of clustering algorithms. Specifically, although traditional clustering algorithms (such as k - means and k - medians) can effectively group data, the resulting clustering is often difficult to interpret because these algorithms rely on all features of the data in a complex way. This limits the user's ability to understand the commonalities between points within the same cluster or the differences between points in different clusters. To solve this problem, the paper proposes a method of dividing the data set using small decision trees, so that each cluster can be described in a simple and intuitive way. The author studies the effectiveness of this method from a theoretical perspective, mainly focusing on whether the quality of tree - based clustering can be compared with that of the optimal unconstrained clustering, and if so, how to find such clustering. By designing efficient algorithms, the paper proves that for two clusters (i.e., k = 2), a single - threshold cut is sufficient to achieve a constant - factor approximation; and for the general case of k ≥ 2, the proposed algorithm can achieve an O(k) approximation to the optimal k - medians and an O(k^2) approximation to the optimal k - means. These results are independent of the data dimension and input scale, thus providing a new perspective and solution for interpretable clustering.