Abstract:Graph based semi-supervised learning is the problem of learning a labeling function for the graph nodes given a few example nodes, often called seeds, usually under the assumption that the graph's edges indicate similarity of labels. This is closely related to the local graph clustering or community detection problem of finding a cluster or community of nodes around a given seed. For this problem, we propose a novel generalization of random walk, diffusion, or smooth function methods in the literature to a convex p-norm cut function. The need for our p-norm methods is that, in our study of existing methods, we find those principled methods based on eigenvector, spectral, random walk, or linear system often have difficulty capturing the correct boundary of a target label or target cluster. In contrast, 1-norm or maxflow-mincut based methods capture the boundary, but cannot grow from small seed set; hybrid procedures that use both have many hard to set parameters. In this paper, we propose a generalization of the objective function behind these methods involving p-norms. To solve the p-norm cut problem we give a strongly local algorithm -- one whose runtime depends on the size of the output rather than the size of the graph. Our method can be thought as a nonlinear generalization of the Anderson-Chung-Lang push procedure to approximate a personalized PageRank vector efficiently. Our procedure is general and can solve other types of nonlinear objective functions, such as p-norm variants of Huber losses. We provide a theoretical analysis of finding planted target clusters with our method and show that the p-norm cut functions improve on the standard Cheeger inequalities for random walk and spectral methods. Finally, we demonstrate the speed and accuracy of our new method in synthetic and real world datasets. Our code is available at this http URL.

Semidefinite programming on population clustering: a local analysis

Debiasing and a local analysis for population clustering using semidefinite programming

Semidefinite programming on population clustering: a global analysis

Clustering Mixtures with Almost Optimal Separation in Polynomial Time

Sketching semidefinite programs for faster clustering

Clustering Mixtures of Bounded Covariance Distributions Under Optimal Separation

Clustering populations by mixed linear models

Efficient Semidefinite Spectral Clustering Via Lagrange Duality.

Statistically Optimal K-means Clustering via Nonnegative Low-rank Semidefinite Programming

Global Optimization for Cardinality-constrained Minimum Sum-of-Squares Clustering via Semidefinite Programming

Gap-Free Clustering: Sensitivity and Robustness of SDP

On spiked eigenvalues of a renormalized sample covariance matrix from multi-population

Analysis of spectral clustering algorithms for community detection: the general bipartite setting

Performance of a community detection algorithm based on semidefinite programming

Stable Cluster Discrimination for Deep Clustering

Spectral Clustering for Discrete Distributions

Spectral Clustering on Large Datasets: When Does it Work? Theory from Continuous Clustering and Density Cheeger-Buser

Fully Scalable MPC Algorithms for Clustering in High Dimension

Scalable Sparse Subspace Clustering by Orthogonal Matching Pursuit

Strongly Local P-Norm-cut Algorithms for Semi-Supervised Learning and Local Graph Clustering

When Do Birds of a Feather Flock Together? K-Means, Proximity, and Conic Programming.