A Comparison of Bayesian Inference Techniques for Sparse Factor Analysis

Yong See Foo,Heejung Shim
DOI: https://doi.org/10.48550/arXiv.2112.11719
2021-12-22
Abstract:Dimension reduction algorithms aim to discover latent variables which describe underlying structures in high-dimensional data. Methods such as factor analysis and principal component analysis have the downside of not offering much interpretability of its inferred latent variables. Sparse factor analysis addresses this issue by imposing sparsity on its factor loadings, allowing each latent variable to be related to only a subset of features, thus increasing interpretability. Sparse factor analysis has been used in a wide range of areas including genomics, signal processing, and economics. We compare two Bayesian inference techniques for sparse factor analysis, namely Markov chain Monte Carlo (MCMC), and variational inference (VI). VI is computationally faster than MCMC, at the cost of a loss in accuracy. We derive MCMC and VI algorithms and perform a comparison using both simulated and biological data, demonstrating that the higher computational efficiency of VI is desirable over the small gain in accuracy when using MCMC. Our implementation of MCMC and VI algorithms for sparse factor analysis is available at <a class="link-external link-https" href="https://github.com/ysfoo/sparsefactor" rel="external noopener nofollow">this https URL</a>.
Applications,Quantitative Methods
What problem does this paper attempt to address?
The problem that this paper attempts to solve is, in sparse factor analysis, to compare the relative advantages and disadvantages of two Bayesian inference techniques - Markov Chain Monte Carlo (MCMC) and Variational Inference (VI), especially the trade - off between accuracy and computational efficiency. Specifically, through the application of simulated data and biological data, the paper evaluates the performance of these two methods in sparse factor analysis, especially when using the spike and slab prior. The research aims to explore which method can more effectively balance computational speed and the accuracy of results in practical applications. ### Background of the paper - **Sparse factor analysis**: It is a technique for dimensionality reduction of high - dimensional data, which improves the interpretability of the model by introducing the sparsity assumption. Sparse factor analysis allows each latent variable to be related to only some of the observed variables, thus enhancing the interpretability of the model. - **Bayesian inference**: It models the sparsity of factor loadings by using prior distributions. Common priors include the spike and slab prior. - **MCMC**: It approximates the posterior distribution by sampling from the posterior distribution. It is usually computationally expensive but has high accuracy. - **VI**: It finds an approximate distribution by optimizing the problem. It has a fast computational speed but may sacrifice some accuracy. ### Research objectives - **Compare MCMC and VI**: Evaluate the performance of the two methods in sparse factor analysis, especially the trade - off between accuracy and computational efficiency. - **Experimental design**: Conduct experiments using simulated data and biological data to evaluate the performance of the two methods. - **Conclusion**: The research finds that although MCMC is slightly better in accuracy, VI has higher computational efficiency, and after multiple runs, VI can achieve an accuracy similar to that of MCMC. ### Main contributions - **Algorithm implementation**: Provide specific implementations of the MCMC and VI algorithms and publish the code on GitHub. - **Performance evaluation**: Through experiments with simulated data and biological data, comprehensively evaluate the performance of the two methods, including accuracy and computation time. - **Practical application**: Demonstrate the application in gene expression data analysis, especially the potential in inferring gene regulatory networks. ### Formula summary - **Data generation model**: \[ Y = LF + E \] where \(Y\) is the observed data matrix, \(L\) is the factor loading matrix, \(F\) is the factor activation matrix, and \(E\) is the error matrix. - **Likelihood function**: \[ p(y_{\cdot j} | L, F, \tau)=\mathcal{N}(y_{\cdot j} | L f_{\cdot j}, \text{diag}(\{\tau_i^{-1}\}_{i = 1}^G)) \] - **Spike and slab prior**: \[ p(l_{ik} | z_{ik}, \alpha_k)= \begin{cases} \delta_0(l_{ik}) & \text{if } z_{ik} = 0\\ \mathcal{N}(l_{ik} | 0, \alpha_k^{-1}) & \text{if } z_{ik} = 1 \end{cases} \] - **Bernoulli prior of the connection matrix**: \[ p(z_{ik})=\text{Bernoulli}(z_{ik} | \pi_k) \] - **Gamma prior**: \[ p(\alpha_k)=\Gamma(\alpha_k | a_\alpha, b_\alpha) \] \[ p(\tau_i)=\Gamma(\tau_i | a_\tau, b_\tau) \] ### Experimental results - **Simulated data**: VI is significantly superior to MCMC in computational speed, and there is not much difference in accuracy. For data sets with less noise, the performance of VI is even close to that of MCMC. - **Biological data**