On the power of linear programming for K-means clustering

Antonio De Rosa,Aida Khajavirad,Yakun Wang
2024-08-16
Abstract:In [SIAM J. Optim., 2022], the authors introduced a new linear programming (LP) relaxation for K-means clustering. In this paper, we further investigate both theoretical and computational properties of this relaxation. As evident from our numerical experiments with both synthetic real-world data sets, the proposed LP relaxation is almost always tight; i.e. its optimal solution is feasible for the original nonconvex problem. To better understand this unexpected behaviour, on the theoretical side, we focus on K-means clustering with two clusters, and we obtain sufficient conditions under which the LP relaxation is tight. We further analyze the sufficient conditions when the input is generated according to a popular stochastic model and obtain recovery guarantees for the LP relaxation. We conclude our theoretical study by constructing a family of inputs for which the LP relaxation is never tight. Denoting by $n$ the number of data points to be clustered, the LP relaxation contains $\Omega(n^3)$ inequalities making it impractical for large data sets. To address the scalability issue, by building upon a cutting-plane algorithm together with the GPU implementation of PDLP, a first-order method LP solver, we develop an efficient algorithm that solves the proposed LP and hence the K-means clustering problem, for up to $n \leq 4000$ data points.
Optimization and Control
What problem does this paper attempt to address?
The paper attempts to address the issue of the effectiveness and applicability of linear programming (LP) relaxation in K-means clustering. Specifically, the authors introduce a new linear programming relaxation method for K-means clustering and further investigate the theoretical and computational properties of this method. The main objectives of the paper include: 1. **Theoretical Analysis**: - **Tightness Conditions**: Study under what conditions the proposed linear programming relaxation method can obtain the same optimal solution as the original non-convex problem. In particular, for the case of 2 clusters, the authors provide sufficient conditions to ensure that the linear programming relaxation is tight. - **Recovery Guarantee**: Analyze these sufficient conditions and provide probabilistic guarantees that the linear programming relaxation can recover the true clustering when the input data follows a popular random model. 2. **Computational Performance**: - **Scalability**: Since the linear programming relaxation involves a large number of inequalities, it is challenging to solve for large datasets. To address this, the authors propose an efficient algorithm based on the cutting plane method and GPU implementation, capable of solving instances with up to 4000 data points. Through these studies, the authors aim to better understand why, in many practical applications, the linear programming relaxation method can effectively solve the K-means clustering problem and is almost always tight. This not only aids in theoretical understanding but also provides strong support for practical applications.