A parallelizable model-based approach for marginal and multivariate clustering

Miguel de Carvalho,Gabriel Martos Venturini,Andrej Svetlošák
DOI: https://doi.org/10.48550/arXiv.2212.04009
2022-12-08
Abstract:This paper develops a clustering method that takes advantage of the sturdiness of model-based clustering, while attempting to mitigate some of its pitfalls. First, we note that standard model-based clustering likely leads to the same number of clusters per margin, which seems a rather artificial assumption for a variety of datasets. We tackle this issue by specifying a finite mixture model per margin that allows each margin to have a different number of clusters, and then cluster the multivariate data using a strategy game-inspired algorithm to which we call Reign-and-Conquer. Second, since the proposed clustering approach only specifies a model for the margins -- but leaves the joint unspecified -- it has the advantage of being partially parallelizable; hence, the proposed approach is computationally appealing as well as more tractable for moderate to high dimensions than a `full' (joint) model-based clustering approach. A battery of numerical experiments on artificial data indicate an overall good performance of the proposed methods in a variety of scenarios, and real datasets are used to showcase their application in practice.
Machine Learning,Methodology
What problem does this paper attempt to address?
This paper attempts to solve two main problems: 1. **The "Single K Problem"**: - Standard model - based clustering methods often assume that each marginal distribution (i.e., each dimension of the data) has the same number of clusters. This assumption seems unnatural in many practical application scenarios because data in different dimensions may have different structures and complexities. Therefore, each marginal distribution should be allowed to have a different number of clusters. - The paper solves this problem by specifying an independent finite mixture model for each marginal distribution, thus allowing each marginal distribution to have a different number of clusters. 2. **The Curse of Dimensionality**: - Model - based clustering methods in high - dimensional data usually need to estimate a large number of parameters, which limits the application of the model on high - dimensional data. Specifically, for \(d\)-dimensional data, the number of parameters that the Gaussian mixture model needs to estimate grows with the square of \(d\), that is, \(O(d^2)\). - The paper proposes a partially parallelized model. By avoiding directly estimating the covariance matrix of the joint distribution, it reduces the number of parameters that need to be estimated, thus improving the computational efficiency and the applicability of the model on medium - and high - dimensional data. ### Main Contributions - **Solving the Single K Problem**: The paper solves the Single K problem by specifying independent finite mixture models for each marginal distribution and using a strategic game heuristic algorithm (called "Reign - and - Conquer") for multivariate clustering. - **Partially Parallelized Model**: The proposed model is partially parallelized, avoiding the problem of needing to estimate a large number of covariance matrix parameters in high - dimensional data, thus improving the computational efficiency. - **Automatically Screening Low - Density Areas**: By setting the minimum entry requirement (sieve size), it automatically screens out areas with only a small amount of mass, thus optimizing the clustering results. - **Game - Theoretical Perspective**: The paper also reinterprets the partitioning method from a game - theoretical perspective and proposes a variant based on Nash equilibrium, although in practical applications, a simpler computational method is mainly adopted. ### Structure and Organization - **Section 2**: Introduces the probability framework for sample space partitioning, including the modeling of marginal distributions and the generation of initial partitions. - **Section 3**: Introduces the specific steps of the Reign - and - Conquer clustering algorithm. - **Section 4**: Reinterprets the partitioning method from a game - theoretical perspective and proposes a variant based on Nash equilibrium. - **Section 5**: Evaluates the performance of the method through artificial data experiments. - **Section 6**: Demonstrates the application of the method through real - data experiments. - **Section 7**: Summarizes and discusses future research directions. ### Numerical Experiments - **Scenario 1**: The data comes from a mixture model of three bivariate normal distributions. The experimental results show that as the sample size increases, the clustering performance gradually improves. - **Scene 2**: The data comes from a mixture model of three Clayton Copulas with different marginal distributions. The experimental results show that this method can cluster effectively even in the case of non - Gaussian distributions. - **Scene 3**: The data comes from medium - and high - dimensional multivariate normal distributions. The experimental results show that as the dimension increases, the clustering performance of this method is better than that of the traditional Gaussian mixture model (GMM). In general, this paper proposes a new model - based clustering method, which solves the Single K problem and the Curse of Dimensionality problem existing in traditional methods, and at the same time achieves significant improvements in computational efficiency and clustering performance.