Abstract:Spectral clustering is one of the most popular clustering algorithms that has stood the test of time. It is simple to describe, can be implemented using standard linear algebra, and often finds better clusters than traditional clustering algorithms like $k$-means and $k$-centers. The foundational algorithm for two-way spectral clustering, by Shi and Malik, creates a geometric graph from data and finds a spectral cut of the graph. In modern machine learning, many data sets are modeled as a large number of points drawn from a probability density function. Little is known about when spectral clustering works in this setting -- and when it doesn't. Past researchers justified spectral clustering by appealing to the graph Cheeger inequality (which states that the spectral cut of a graph approximates the ``Normalized Cut''), but this justification is known to break down on large data sets. We provide theoretically-informed intuition about spectral clustering on large data sets drawn from probability densities, by proving when a continuous form of spectral clustering considered by past researchers (the unweighted spectral cut of a probability density) finds good clusters of the underlying density itself. Our work suggests that Shi-Malik spectral clustering works well on data drawn from mixtures of Laplace distributions, and works poorly on data drawn from certain other densities, such as a density we call the `square-root trough'. Our core theorem proves that weighted spectral cuts have low weighted isoperimetry for all probability densities. Our key tool is a new Cheeger-Buser inequality for all probability densities, including discontinuous ones.

Inverse-degree Sampling for Spectral Clustering

Method to Determine the NIS Based on SVD and Clustering

A Unified Framework for Representation-Based Subspace Clustering of Out-of-Sample and Large-Scale Data.

Discriminative Nonnegative Spectral Clustering with Out-of-Sample Extension

Spectral Embedded Clustering: A Framework for In-Sample and Out-of-Sample Spectral Clustering

Subspace Clustering by Directly Solving Discriminative K-means

Sampling Fuzzy K-Means Clustering Algorithm Based on Clonal Optimization

Scalable Spectral Clustering Using Random Binning Features

A Restarted Large-Scale Spectral Clustering with Self-Guiding and Block Diagonal Representation

Randomized Spectral Clustering in Large-Scale Stochastic Block Models

Fast Spectral Clustering with Landmark-Based Subspace Iteration

Spectral clustering on spherical coordinates under the degree-corrected stochastic blockmodel

Spectral clustering with linear embedding: A discrete clustering method for large-scale data

Unified Spectral Clustering with Optimal Graph

Spectral Clustering on Large Datasets: When Does it Work? Theory from Continuous Clustering and Density Cheeger-Buser

A Novel and Effective Method to Directly Solve Spectral Clustering

Discretize Relaxed Solution of Spectral Clustering via a Non-Heuristic Algorithm

Automatic Determination of Intrinsic Cluster Number Family in Spectral Clustering Using Random Walk on Graph.

Iterative Subsampling in Solution Path Clustering of Noisy Big Data

Outlier Cluster Formation in Spectral Clustering

A Convex Formulation for Spectral Shrunk Clustering