Finite Mixtures of Multivariate Poisson-Log Normal Factor Analyzers for Clustering Count Data

Andrea Payne,Anjali Silva,Steven J. Rothstein,Paul D. McNicholas,Sanjeena Subedi
2023-11-14
Abstract:A mixture of multivariate Poisson-log normal factor analyzers is introduced by imposing constraints on the covariance matrix, which resulted in flexible models for clustering purposes. In particular, a class of eight parsimonious mixture models based on the mixtures of factor analyzers model are introduced. Variational Gaussian approximation is used for parameter estimation, and information criteria are used for model selection. The proposed models are explored in the context of clustering discrete data arising from RNA sequencing studies. Using real and simulated data, the models are shown to give favourable clustering performance. The GitHub R package for this work is available at
Machine Learning,Computation
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to develop a new factor - analysis mixture model based on the Multivariate Poisson - Log Normal (MPLN) for clustering count data. Specifically, the researchers introduced a class of eight parsimonious MPLN factor - analysis mixture models. By imposing constraints on the covariance matrix, they achieved a flexible model structure, reduced the number of parameters, and thus improved the applicability and efficiency of the model. These models are particularly suitable for clustering discrete data in RNA - sequencing studies, can handle over - dispersion in the data, and can accommodate positive and negative correlations. The method proposed in the paper aims to overcome the limitations of traditional univariate distribution models (such as the negative binomial distribution model) when dealing with multivariate RNA - seq data. These traditional models assume that variables are independent of each other and cannot effectively capture the correlations between variables. By introducing the Multivariate Poisson - Log Normal distribution, the researchers can better handle the complex structures in RNA - seq data and provide more accurate clustering results. In addition, the paper also explored the application of the variational Gaussian approximation method in parameter estimation and the role of information criteria in model selection, and demonstrated the good clustering performance of the proposed model on real - data and simulated data. These achievements are of great significance for data analysis in the field of bioinformatics, especially for the analysis of RNA - sequencing data.