Nonparametric Bayesian Negative Binomial Factor Analysis

Mingyuan Zhou
DOI: https://doi.org/10.48550/arXiv.1604.07464
2017-10-05
Abstract:A common approach to analyze a covariate-sample count matrix, an element of which represents how many times a covariate appears in a sample, is to factorize it under the Poisson likelihood. We show its limitation in capturing the tendency for a covariate present in a sample to both repeat itself and excite related ones. To address this limitation, we construct negative binomial factor analysis (NBFA) to factorize the matrix under the negative binomial likelihood, and relate it to a Dirichlet-multinomial distribution based mixed-membership model. To support countably infinite factors, we propose the hierarchical gamma-negative binomial process. By exploiting newly proved connections between discrete distributions, we construct two blocked and a collapsed Gibbs sampler that all adaptively truncate their number of factors, and demonstrate that the blocked Gibbs sampler developed under a compound Poisson representation converges fast and has low computational complexity. Example results show that NBFA has a distinct mechanism in adjusting its number of inferred factors according to the sample lengths, and provides clear advantages in parsimonious representation, predictive power, and computational complexity over previously proposed discrete latent variable models, which either completely ignore burstiness, or model only the burstiness of the covariates but not that of the factors.
Methodology,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitations of existing methods such as Poisson factor analysis (PFA) and mixed - membership models in capturing the self - excitation and cross - excitation trends of covariates when analyzing the covariate - sample count matrix. Specifically: 1. **Limitations of existing methods**: - **Poisson factor analysis (PFA)**: It assumes that the variance and mean of each covariate - sample count are the same, which may lead to an underestimation of over - dispersed counts. - **Mixed - membership models**: Each index is independently assigned to a covariate and a sub - group, and may not fully capture the trend that one index excites other indexes in the same sample to select the same or related covariates. 2. **Practical problems**: - In practice, highly over - dispersed covariate - sample counts often occur due to the self - excitation and cross - excitation of covariate frequencies. For example, in natural language processing, some words in a document are particularly frequent and may also excite the frequent appearance of related words. This phenomenon is called word burstiness. 3. **The method proposed in the paper**: - Introduce negative binomial factor analysis (NBFA) to decompose the covariate - sample count matrix by replacing the Poisson distribution in PFA with the negative binomial distribution. - By introducing the negative binomial distribution, NBFA can better model over - dispersed counts and capture the burstiness at the covariate and factor levels. 4. **Specific objectives**: - **Improve model performance**: Provide a more concise representation, stronger predictive ability, and lower computational complexity. - **Reduce computational waste**: Avoid computational waste caused by attempting to increase the model capacity to capture self - excitation and cross - excitation that can be simply explained. In summary, this paper aims to overcome the deficiencies of existing methods in dealing with highly over - dispersed counts by introducing negative binomial factor analysis (NBFA), thereby improving the performance and efficiency of the model in multiple aspects.