Generalized Grade-of-Membership Estimation for High-dimensional Locally Dependent Data

Ling Chen,Chengzhu Huang,Yuqi Gu
2024-12-28
Abstract:This work focuses on the mixed membership models for multivariate categorical data widely used for analyzing survey responses and population genetics data. These grade of membership (GoM) models offer rich modeling power but present significant estimation challenges for high-dimensional polytomous data. Popular existing approaches, such as Bayesian MCMC inference, are not scalable and lack theoretical guarantees in high-dimensional settings. To address this, we first observe that data from this model can be reformulated as a three-way (quasi-)tensor, with many subjects responding to many items with varying numbers of categories. We introduce a novel and simple approach that flattens the three-way quasi-tensor into a "fat" matrix, and then perform a singular value decomposition of it to estimate parameters by exploiting the singular subspace geometry. Our fast spectral method can accommodate a broad range of data distributions with arbitrarily locally dependent noise, which we formalize as the generalized-GoM models. We establish finite-sample entrywise error bounds for the generalized-GoM model parameters. This is supported by a new sharp two-to-infinity singular subspace perturbation theory for locally dependent and flexibly distributed noise, a contribution of independent interest. Simulations and applications to data in political surveys, population genetics, and single-cell sequencing demonstrate our method's superior performance.
Methodology,Statistics Theory,Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper mainly focuses on the problem of estimating Mixed Membership Models in high - dimensional locally - correlated data, especially for multivariate categorical data. Specifically, the paper attempts to solve the following key problems: 1. **Limitations of existing methods**: - The existing Bayesian MCMC inference methods have high computational costs when dealing with high - dimensional multivariate categorical data and are difficult to scale to large data sets. - These methods lack theoretical guarantees in high - dimensional settings, especially in the presence of local dependencies. 2. **Complexity of high - dimensional multivariate categorical data**: - High - dimensional multivariate categorical data (such as survey questionnaires, genomics data, etc.) have complex hierarchical structures and local dependencies, making it difficult for traditional estimation methods to be directly applied. - Local dependencies refer to the existence of additional dependency relationships between the observed multivariate responses, which cannot be fully explained by latent variables. 3. **Lack of theoretical guarantees**: - In high - dimensional settings, there is currently a lack of strict theoretical error bounds for parameter estimation of Mixed Membership Models (especially GoM models). To solve these problems, the author proposes a fast spectral method based on Singular Value Decomposition (SVD). This method flattens the third - order quasi - tensor into a "fat" matrix and then uses the singular subspace geometry for parameter estimation. By introducing generalized - GoM models, this method can handle various data distributions and allow for arbitrary local - dependency noise. In addition, the author also develops a new sharp two - to - infinity singular subspace perturbation theory to support the performance of their method under local dependencies and flexible - distribution noise. This theory is not only of great significance for Mixed Membership Models but also of independent interest for other latent variable models. In summary, this paper aims to provide an efficient and theoretically - guaranteed method for estimating Mixed Membership Models in high - dimensional locally - correlated data, especially in the case of multivariate categorical data.