Abstract:We study a general factor analysis framework where the $n$-by-$p$ data matrix is assumed to follow a general exponential family distribution entry-wise. While this model framework has been proposed before, we here further relax its distributional assumption by using a quasi-likelihood setup. By parameterizing the mean-variance relationship on data entries, we additionally introduce a dispersion parameter and entry-wise weights to model large variations and missing values. The resulting model is thus not only robust to distribution misspecification but also more flexible and able to capture non-Gaussian covariance structures of the data matrix. Our main focus is on efficient computational approaches to perform the factor analysis. Previous modeling frameworks rely on simulated maximum likelihood (SML) to find the factorization solution, but this method was shown to lead to asymptotic bias when the simulated sample size grows slower than the square root of the sample size $n$, eliminating its practical application for data matrices with large $n$. Borrowing from expectation-maximization (EM) and stochastic gradient descent (SGD), we investigate three estimation procedures based on iterative factorization updates. Our proposed solution does not show asymptotic biases, and scales even better for large matrix factorizations with error $O(1/p)$. To support our findings, we conduct simulation experiments and discuss its application in three case studies.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are several key limitations of existing factor analysis models when dealing with high - dimensional data. Specifically: 1. **Overly strict distribution assumptions**: Traditional factor models assume that both the data and latent variables follow a Gaussian distribution, which is not suitable for binary, count, or other non - constant variance data types. Although some studies have attempted to extend factor models using more general exponential family distributions, these extensions are still insufficient to deal with over - dispersed data in the real world. 2. **Model identifiability issues**: The latent factors in factor models can usually only be identified through rotation transformations, which may lead to difficulties in interpretation. The paper proposes to improve the interpretability of latent factors through orthogonal identifiability constraints. 3. **Lack of ability to model missing data**: The existing factor analysis framework lacks flexibility and cannot effectively handle missing data, limiting its application in areas such as matrix completion. 4. **Efficiency and robustness issues of optimization algorithms**: Traditional simulated maximum likelihood estimation (SML) methods have numerical and theoretical problems, such as asymptotic bias and low computational efficiency, when dealing with large - scale datasets. The paper proposes efficient and robust optimization algorithms based on expectation - maximization (EM) and stochastic gradient descent (SGD) to solve these problems. To overcome the above limitations, the paper proposes a more general exponential family factor model (Exponential Family Factor Model, EFM) and makes improvements in the following aspects: - **Assume the mean - variance relationship of data**: Introduce discrete parameters of column vectors to model data covariance, making the model not only robust to mis - specification of distribution assumptions but also more flexible and able to capture the covariance structure of non - Gaussian data. - **Provide interpretability of latent factors**: Improve the interpretability of latent factors through orthogonal identifiability constraints. - **Propose fast, accurate, and robust optimization algorithms**: Utilize modern stochastic optimization techniques to propose optimization algorithms based on EM and SGD. These algorithms not only have no asymptotic bias but also perform better in large - scale matrix factorization. - **Implement an efficient package**: Develop an efficient software package that allows for element - by - element factor weight and covariance modeling, facilitating practical applications. In summary, this paper aims to make factor analysis models more flexible, efficient, and practical through the above improvements, suitable for processing high - dimensional, non - Gaussian data, and able to effectively handle missing data.

Computational Approaches for Exponential-Family Factor Analysis

Exponential Family Factors for Bayesian Factor Analysis

Factor modelling for high-dimensional functional time series

Expandable Factor Analysis

Modeling High-Dimensional Time Series: A Factor Model with Dynamically Dependent Factors and Diverging Eigenvalues

The Application of Spectral Distribution of Product Matrices of Large Dimensional Random Matrices in the Factor Analysis

Econometric Analysis of Large Factor Models

Projected estimation for large-dimensional matrix factor models

Fitting Multilevel Factor Models

Flexible Principal Component Analysis for Exponential Family Distributions

Large-Dimensional Factor Analysis Without Moment Constraints

Optimal Estimation of Large-Dimensional Nonlinear Factor Models

Fast Bayesian Factor Analysis via Automatic Rotations to Sparsity

High-dimensional Factor Model and Its Applications to Statistical Machine Learning

Matrix Factor Analysis: from Least Squares to Iterative Projection

Factor-guided estimation of large covariance matrix function with conditional functional sparsity

Manifold Principle Component Analysis for Large-Dimensional Matrix Elliptical Factor Model

Empirical Bayes Matrix Factorization

Projected principal component analysis in factor models

High-dimensional covariate-augmented overdispersed poisson factor model