Computational Approaches for Exponential-Family Factor Analysis

Liang Wang,Luis Carvalho
2024-03-22
Abstract:We study a general factor analysis framework where the $n$-by-$p$ data matrix is assumed to follow a general exponential family distribution entry-wise. While this model framework has been proposed before, we here further relax its distributional assumption by using a quasi-likelihood setup. By parameterizing the mean-variance relationship on data entries, we additionally introduce a dispersion parameter and entry-wise weights to model large variations and missing values. The resulting model is thus not only robust to distribution misspecification but also more flexible and able to capture non-Gaussian covariance structures of the data matrix. Our main focus is on efficient computational approaches to perform the factor analysis. Previous modeling frameworks rely on simulated maximum likelihood (SML) to find the factorization solution, but this method was shown to lead to asymptotic bias when the simulated sample size grows slower than the square root of the sample size $n$, eliminating its practical application for data matrices with large $n$. Borrowing from expectation-maximization (EM) and stochastic gradient descent (SGD), we investigate three estimation procedures based on iterative factorization updates. Our proposed solution does not show asymptotic biases, and scales even better for large matrix factorizations with error $O(1/p)$. To support our findings, we conduct simulation experiments and discuss its application in three case studies.
Methodology,Computation
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are several key limitations of existing factor analysis models when dealing with high - dimensional data. Specifically: 1. **Overly strict distribution assumptions**: Traditional factor models assume that both the data and latent variables follow a Gaussian distribution, which is not suitable for binary, count, or other non - constant variance data types. Although some studies have attempted to extend factor models using more general exponential family distributions, these extensions are still insufficient to deal with over - dispersed data in the real world. 2. **Model identifiability issues**: The latent factors in factor models can usually only be identified through rotation transformations, which may lead to difficulties in interpretation. The paper proposes to improve the interpretability of latent factors through orthogonal identifiability constraints. 3. **Lack of ability to model missing data**: The existing factor analysis framework lacks flexibility and cannot effectively handle missing data, limiting its application in areas such as matrix completion. 4. **Efficiency and robustness issues of optimization algorithms**: Traditional simulated maximum likelihood estimation (SML) methods have numerical and theoretical problems, such as asymptotic bias and low computational efficiency, when dealing with large - scale datasets. The paper proposes efficient and robust optimization algorithms based on expectation - maximization (EM) and stochastic gradient descent (SGD) to solve these problems. To overcome the above limitations, the paper proposes a more general exponential family factor model (Exponential Family Factor Model, EFM) and makes improvements in the following aspects: - **Assume the mean - variance relationship of data**: Introduce discrete parameters of column vectors to model data covariance, making the model not only robust to mis - specification of distribution assumptions but also more flexible and able to capture the covariance structure of non - Gaussian data. - **Provide interpretability of latent factors**: Improve the interpretability of latent factors through orthogonal identifiability constraints. - **Propose fast, accurate, and robust optimization algorithms**: Utilize modern stochastic optimization techniques to propose optimization algorithms based on EM and SGD. These algorithms not only have no asymptotic bias but also perform better in large - scale matrix factorization. - **Implement an efficient package**: Develop an efficient software package that allows for element - by - element factor weight and covariance modeling, facilitating practical applications. In summary, this paper aims to make factor analysis models more flexible, efficient, and practical through the above improvements, suitable for processing high - dimensional, non - Gaussian data, and able to effectively handle missing data.