Bayesian Nonparametric Graph Clustering

Sayantan Banerjee,Rehan Akbani,Veerabhadran Baladandayuthapani
DOI: https://doi.org/10.48550/arXiv.1509.07535
2015-09-25
Abstract:We present clustering methods for multivariate data exploiting the underlying geometry of the graphical structure between variables. As opposed to standard approaches that assume known graph structures, we first estimate the edge structure of the unknown graph using Bayesian neighborhood selection approaches, wherein we account for the uncertainty of graphical structure learning through model-averaged estimates of the suitable parameters. Subsequently, we develop a nonparametric graph clustering model on the lower dimensional projections of the graph based on Laplacian embeddings using Dirichlet process mixture models. In contrast to standard algorithmic approaches, this fully probabilistic approach allows incorporation of uncertainty in estimation and inference for both graph structure learning and clustering. More importantly, we formalize the arguments for Laplacian embeddings as suitable projections for graph clustering by providing theoretical support for the consistency of the eigenspace of the estimated graph Laplacians. We develop fast computational algorithms that allow our methods to scale to large number of nodes. Through extensive simulations we compare our clustering performance with standard clustering methods. We apply our methods to a novel pan-cancer proteomic data set, and evaluate protein networks and clusters across multiple different cancer types.
Methodology,Applications
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to use the underlying graphical structure for clustering in multivariate data, especially when the data can be modeled as a graph or a network. Specifically, the paper proposes a method that first estimates the edge structure of an unknown graph and then clusters the low - dimensional projections using a non - parametric graph clustering model based on graph Laplacian embedding. This method is different from traditional algorithmic methods in that it allows the explicit incorporation of uncertainty in the graph structure learning and clustering process. In addition, the paper also provides theoretical support for Laplacian embedding as a suitable projection for graph clustering and proves the consistency of estimating the graph Laplacian eigenspace. The main contributions of the paper are as follows: 1. **Graph Structure Learning**: The paper adopts the Bayesian neighborhood selection method to estimate the edge structure of an unknown graph. By estimating appropriate parameters through model averaging, it takes into account the uncertainty in graph structure learning. 2. **Non - parametric Graph Clustering**: Based on graph Laplacian embedding, the Dirichlet Process Mixture Models (DPMMs) are used to cluster the low - dimensional projections. This method does not require pre - specifying the number of clusters and can handle clusters of different sizes. 3. **Theoretical Support**: It provides a theoretical basis for Laplacian embedding as a suitable projection for graph clustering and proves the consistency of estimating the graph Laplacian eigenspace, ensuring that the estimated graph Laplacian can be used as an effective graph clustering tool. 4. **Efficient Algorithm**: A fast - computing algorithm has been developed, enabling the method to be extended to applications with a large number of nodes. Through extensive simulation experiments and applications to real - data sets, the paper demonstrates the superiority of its method in performance, especially when dealing with large - scale data sets and complex network structures. The paper's method is applied to a new pan - cancer proteomics data set to evaluate protein networks and clustering in different cancer types, revealing biologically driven clusters common to multiple cancers as well as differential clusters specific to a certain cancer.