Abstract:A common approach to analyze a covariate-sample count matrix, an element of which represents how many times a covariate appears in a sample, is to factorize it under the Poisson likelihood. We show its limitation in capturing the tendency for a covariate present in a sample to both repeat itself and excite related ones. To address this limitation, we construct negative binomial factor analysis (NBFA) to factorize the matrix under the negative binomial likelihood, and relate it to a Dirichlet-multinomial distribution based mixed-membership model. To support countably infinite factors, we propose the hierarchical gamma-negative binomial process. By exploiting newly proved connections between discrete distributions, we construct two blocked and a collapsed Gibbs sampler that all adaptively truncate their number of factors, and demonstrate that the blocked Gibbs sampler developed under a compound Poisson representation converges fast and has low computational complexity. Example results show that NBFA has a distinct mechanism in adjusting its number of inferred factors according to the sample lengths, and provides clear advantages in parsimonious representation, predictive power, and computational complexity over previously proposed discrete latent variable models, which either completely ignore burstiness, or model only the burstiness of the covariates but not that of the factors.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the limitations of existing methods such as Poisson factor analysis (PFA) and mixed - membership models in capturing the self - excitation and cross - excitation trends of covariates when analyzing the covariate - sample count matrix. Specifically: 1. **Limitations of existing methods**: - **Poisson factor analysis (PFA)**: It assumes that the variance and mean of each covariate - sample count are the same, which may lead to an underestimation of over - dispersed counts. - **Mixed - membership models**: Each index is independently assigned to a covariate and a sub - group, and may not fully capture the trend that one index excites other indexes in the same sample to select the same or related covariates. 2. **Practical problems**: - In practice, highly over - dispersed covariate - sample counts often occur due to the self - excitation and cross - excitation of covariate frequencies. For example, in natural language processing, some words in a document are particularly frequent and may also excite the frequent appearance of related words. This phenomenon is called word burstiness. 3. **The method proposed in the paper**: - Introduce negative binomial factor analysis (NBFA) to decompose the covariate - sample count matrix by replacing the Poisson distribution in PFA with the negative binomial distribution. - By introducing the negative binomial distribution, NBFA can better model over - dispersed counts and capture the burstiness at the covariate and factor levels. 4. **Specific objectives**: - **Improve model performance**: Provide a more concise representation, stronger predictive ability, and lower computational complexity. - **Reduce computational waste**: Avoid computational waste caused by attempting to increase the model capacity to capture self - excitation and cross - excitation that can be simply explained. In summary, this paper aims to overcome the deficiencies of existing methods in dealing with highly over - dispersed counts by introducing negative binomial factor analysis (NBFA), thereby improving the performance and efficiency of the model in multiple aspects.

Nonparametric Bayesian Negative Binomial Factor Analysis

High-dimensional covariate-augmented overdispersed poisson factor model

High-Dimensional Covariate-Augmented Overdispersed Multi-Study Poisson Factor Model

Finite Mixtures of Multivariate Poisson-Log Normal Factor Analyzers for Clustering Count Data

Assessing an Alternative for `Negative Variance Components': A Gentle Introduction to Bayesian Covariance Structure Modelling for Negative Associations Among Patients with Personalized Treatments

Priors for Random Count Matrices Derived from a Family of Negative Binomial Processes

Bayesian mean-parameterized nonnegative binary matrix factorization

Exponential Family Factors for Bayesian Factor Analysis

Blessing of dimension in Bayesian inference on covariance matrices

A parsimonious family of multivariate Poisson-lognormal distributions for clustering multivariate count data

A sparse factor model for clustering high‐dimensional longitudinal data

Negative-Binomial Randomized Gamma Markov Processes for Heterogeneous Overdispersed Count Time Series

Accurate inference in negative binomial regression

Negative Binomial factor regression with application to microbiome data analysis

One mixed negative binomial distribution with application

On the Null Distribution of Bayes Factors in Linear Regression

A Bayesian Zero-Inflated Dirichlet-Multinomial Regression Model for Multivariate Compositional Count Data

A Majorization-Minimization Algorithm for Nonnegative Binary Matrix Factorization

Normalized Latent Measure Factor Models

Sparse Bayesian factor analysis when the number of factors is unknown