High-dimensional Factor Analysis for Network-linked Data

Jinming Li,Gongjun Xu,Ji Zhu
2024-03-26
Abstract:Factor analysis is a widely used statistical tool in many scientific disciplines, such as psychology, economics, and sociology. As observations linked by networks become increasingly common, incorporating network structures into factor analysis remains an open problem. In this paper, we focus on high-dimensional factor analysis involving network-connected observations, and propose a generalized factor model with latent factors that account for both the network structure and the dependence structure among high-dimensional variables. These latent factors can be shared by the high-dimensional variables and the network, or exclusively applied to either of them. We develop a computationally efficient estimation procedure and establish asymptotic inferential theories. Notably, we show that by borrowing information from the network, the proposed estimator of the factor loading matrix achieves optimal asymptotic variance under much milder identifiability constraints than existing literature. Furthermore, we develop a hypothesis testing procedure to tackle the challenge of discerning the shared and individual latent factors' structure. The finite sample performance of the proposed method is demonstrated through simulation studies and a real-world dataset involving a statistician co-authorship network.
Methodology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to conduct factor analysis in combination with network structure in high - dimensional data. Specifically, traditional factor analysis is mainly applied in fields such as psychology, economics, and sociology. However, as network - connected data becomes more and more common, how to incorporate factor analysis into network structure remains an unsolved problem. This paper proposes a generalized factor model. This model not only considers latent factors related to node variables but also introduces additional latent factors to explain the dependence structure of the network. These latent factors can be shared by high - dimensional variables and the network or can be applicable to only one of them. ### Main problems 1. **Relationship between network structure and high - dimensional variables**: - How to consider network connections and high - dimensional variables simultaneously in factor analysis? - How to improve the optimal asymptotic variance of factor loading matrix estimation through network information? 2. **Model identification conditions**: - How to reduce the number of conditions required for model identification? - How to achieve effective statistical inference under more relaxed conditions? 3. **Hypothesis testing**: - How to distinguish the structures of shared and individual latent factors? - How to determine the dimensions of latent factors, especially the dimensions of \( Z_1 \) and \( Z_2 \)? ### Solutions 1. **Generalized factor model**: - A generalized factor model is proposed, which can handle network connections between high - dimensional node variables and observations simultaneously. - The latent factors in the model are divided into three parts: \( Z_1 \) is only related to network data, \( Z_3 \) is only related to node variables, and \( Z_2 \) is a shared factor. 2. **Estimation method**: - A computationally efficient estimation procedure is developed, including a two - step method: first, estimate \( \hat{Z}_{12} \) from network data, and then regress high - dimensional variables onto \( \hat{Z}_{12} \) to estimate factor loadings. - The asymptotic distribution theory of the factor loading matrix estimator is established, showing that the optimal asymptotic variance can be achieved under weaker identification conditions. 3. **Hypothesis testing**: - A hypothesis testing procedure is proposed to determine the structures of shared and individual latent factors. - The dimension of \( Z_3 \) is estimated by the sequential testing method, and \( Z_1 \) and \( Z_2 \) are distinguished by multi - hypothesis testing. ### Main contributions 1. **Model aspect**: - The proposed generalized factor model incorporates network information into factor analysis of high - dimensional variables, reducing the constraints on effective statistical inference. 2. **Statistical inference aspect**: - New statistical hypothesis testing procedures are proposed for inferring shared and individual latent factors. - The asymptotic distribution results of latent factors and factor loading estimators are established, showing that the optimal asymptotic variance can be achieved under weaker identification conditions. 3. **Statistical learning aspect**: - The flexibility of the model allows latent factors to be shared or exclusive between high - dimensional variables and the network, which shows better performance in downstream tasks. Through these methods, the paper provides a comprehensive framework to solve the key problems in high - dimensional factor analysis in network - connected data.