On the power of linear programming for K-means clustering

Antonio De Rosa,Aida Khajavirad,Yakun Wang

2024-08-16

Abstract:In [SIAM J. Optim., 2022], the authors introduced a new linear programming (LP) relaxation for K-means clustering. In this paper, we further investigate both theoretical and computational properties of this relaxation. As evident from our numerical experiments with both synthetic real-world data sets, the proposed LP relaxation is almost always tight; i.e. its optimal solution is feasible for the original nonconvex problem. To better understand this unexpected behaviour, on the theoretical side, we focus on K-means clustering with two clusters, and we obtain sufficient conditions under which the LP relaxation is tight. We further analyze the sufficient conditions when the input is generated according to a popular stochastic model and obtain recovery guarantees for the LP relaxation. We conclude our theoretical study by constructing a family of inputs for which the LP relaxation is never tight. Denoting by $n$ the number of data points to be clustered, the LP relaxation contains $\Omega(n^3)$ inequalities making it impractical for large data sets. To address the scalability issue, by building upon a cutting-plane algorithm together with the GPU implementation of PDLP, a first-order method LP solver, we develop an efficient algorithm that solves the proposed LP and hence the K-means clustering problem, for up to $n \leq 4000$ data points.

Optimization and Control

What problem does this paper attempt to address?

The paper attempts to address the issue of the effectiveness and applicability of linear programming (LP) relaxation in K-means clustering. Specifically, the authors introduce a new linear programming relaxation method for K-means clustering and further investigate the theoretical and computational properties of this method. The main objectives of the paper include: 1. **Theoretical Analysis**: - **Tightness Conditions**: Study under what conditions the proposed linear programming relaxation method can obtain the same optimal solution as the original non-convex problem. In particular, for the case of 2 clusters, the authors provide sufficient conditions to ensure that the linear programming relaxation is tight. - **Recovery Guarantee**: Analyze these sufficient conditions and provide probabilistic guarantees that the linear programming relaxation can recover the true clustering when the input data follows a popular random model. 2. **Computational Performance**: - **Scalability**: Since the linear programming relaxation involves a large number of inequalities, it is challenging to solve for large datasets. To address this, the authors propose an efficient algorithm based on the cutting plane method and GPU implementation, capable of solving instances with up to 4000 data points. Through these studies, the authors aim to better understand why, in many practical applications, the linear programming relaxation method can effectively solve the K-means clustering problem and is almost always tight. This not only aids in theoretical understanding but also provides strong support for practical applications.

On the power of linear programming for K-means clustering

Relax, no need to round: integrality of clustering formulations

A dependent LP-rounding approach for the k-median problem

Relax and Merge: A Simple Yet Effective Framework for Solving Fair $k$-Means and $k$-sparse Wasserstein Barycenter Problems

Probabilistic K-means Clustering via Nonlinear Programming

A cutting plane algorithm for globally solving low dimensional k-means clustering problems

When Do Birds of a Feather Flock Together? K-Means, Proximity, and Conic Programming.

Speeding up Linear Programming using Randomized Linear Algebra

Improved Conic Reformulations for K-means Clustering

Statistically Optimal K-means Clustering via Nonnegative Low-rank Semidefinite Programming

When do birds of a feather flock together?

Understanding the Cluster Linear Program for Correlation Clustering

Data-driven optimal control via linear programming: boundedness guarantees

Global Optimization for Cardinality-constrained Minimum Sum-of-Squares Clustering via Semidefinite Programming

A New branch-and-cut algorithm for linear sum-of-ratios problem based on SLO method and LO relaxation

Probably certifiably correct k-means clustering

Linear Programming Relaxations of Quadratically Constrained Quadratic Programs

Linearization of McCormick relaxations and hybridization with the auxiliary variable method

Near-Optimal Algorithms for Constrained k-Center Clustering with Instance-level Background Knowledge

Constant Approximation for K-Median and K-Means with Outliers Via Iterative Rounding

Exact Algorithms and Lower Bounds for Stable Instances of Euclidean k-Means