Latent class analysis by regularized spectral clustering

Huan Qing
2023-10-28
Abstract:The latent class model is a powerful tool for identifying latent classes within populations that share common characteristics for categorical data in social, psychological, and behavioral sciences. In this article, we propose two new algorithms to estimate a latent class model for categorical data. Our algorithms are developed by using a newly defined regularized Laplacian matrix calculated from the response matrix. We provide theoretical convergence rates of our algorithms by considering a sparsity parameter and show that our algorithms stably yield consistent latent class analysis under mild conditions. Additionally, we propose a metric to capture the strength of latent class analysis and several procedures designed based on this metric to infer how many latent classes one should use for real-world categorical data. The efficiency and accuracy of our algorithms are verified by extensive simulated experiments, and we further apply our algorithms to real-world categorical data with promising results.
Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is several challenges encountered in latent class analysis (LCA) in categorical data analysis. Specifically, the authors propose two new methods based on regularized spectral clustering to estimate the latent class model (LCM) in categorical data. These methods aim to solve the following problems: 1. **Sparsity influence**: Existing methods do not fully consider the sparsity in categorical data, which can be described by a parameter in the LCM. The authors introduce a sparsity parameter and prove its influence on the algorithm performance. 2. **Computational efficiency and theoretical guarantee**: Bayesian inference and maximum likelihood estimation (MLE) methods are computationally expensive, and most methods lack theoretical convergence guarantees. The newly proposed algorithms not only have high computational efficiency but also provide theoretical convergence rate guarantees. 3. **Binary response and multi - category response**: Most existing methods are only applicable to binary response data and ignore multi - category response data. The methods proposed in this paper can handle a wider range of categorical data types. 4. **Lack of evaluation criteria**: For categorical data in the real world, there is a lack of criteria for evaluating the quality of latent class analysis. To this end, the authors introduce an evaluation index based on Newman - Girvan modularity and design a corresponding algorithm to determine the optimal number of latent classes. ### Main contributions - **New methods**: Two new regularized spectral clustering methods are proposed, which use the singular value decomposition of the newly defined regularized Laplacian matrix to estimate the latent class model. - **Theoretical analysis**: By introducing the sparsity parameter, the error rate of the algorithm is established, and it is proved that under mild conditions of the sparsity of the response matrix, the algorithm can consistently estimate the latent classes and other LCM parameters. - **Evaluation index**: An index based on Newman - Girvan modularity is proposed to measure the quality of latent class analysis, and an algorithm for maximizing modularity is designed to estimate the number of latent classes. ### Method overview The authors assume that each latent class contains at least one individual and define the regularized Laplacian matrix \(L_{\tau}\). By performing singular value decomposition (SVD) on \(L_{\tau}\) and combining with the K - means clustering algorithm, the classification matrix \(Z\) and the item parameter matrix \(\Theta\) can be effectively estimated. The specific algorithm steps are as follows: 1. Calculate the regularized Laplacian matrix \(L_{\tau}\). 2. Obtain the first \(K\) singular value decompositions \(U\Sigma V'\) of \(L_{\tau}\). 3. Perform K - means clustering on all rows of \(U\) or \(U^{*}\) to obtain the classification matrix \(\hat{Z}\). 4. Recover the item parameter matrix \(\hat{\Theta}\). In addition, the authors also propose several baseline methods (such as LCA - RSCORS, LCA - PCA, etc.) for comparison, and verify the effectiveness and accuracy of the proposed methods through simulation experiments and real - data applications. ### Conclusion This paper solves several problems existing in the existing latent class analysis methods by introducing new regularized spectral clustering methods, provides efficient algorithms and evaluation criteria, and brings important progress to the field of categorical data analysis.