Adaptive gPCA: A method for structured dimensionality reduction

Julia Fukuyama
DOI: https://doi.org/10.48550/arXiv.1702.00501
2017-02-02
Abstract:When working with large biological data sets, exploratory analysis is an important first step for understanding the latent structure and for generating hypotheses to be tested in subsequent analyses. However, when the number of variables is large compared to the number of samples, standard methods such as principal components analysis give results which are unstable and difficult to interpret. To mitigate these problems, we have developed a method which allows the analyst to incorporate side information about the relationships between the variables in a way that encourages similar variables to have similar loadings on the principal axes. This leads to a low-dimensional representation of the samples which both describes the latent structure and which has axes which are interpretable in terms of groups of closely related variables. The method is derived by putting a prior encoding the relationships between the variables on the data and following through the analysis on the posterior distributions of the samples. We show that our method does well at reconstructing true latent structure in simulated data and we also demonstrate the method on a dataset investigating the effects of antibiotics on the composition of bacteria in the human gut.
Methodology,Applications
What problem does this paper attempt to address?
This paper attempts to solve the problems encountered in exploratory analysis when dealing with large - scale biological data sets. When the number of variables is much larger than the number of samples, the results produced by the traditional principal component analysis (PCA) method are unstable and difficult to interpret. To alleviate these problems, the authors developed a new method - adaptive generalized principal component analysis (adaptive gPCA), which allows analysts to incorporate external information about the relationships between variables in a way that encourages similar variables to have similar principal axis loadings. This results in a low - dimensional representation of the samples, which not only describes the underlying structure, but also whose principal axes can be interpreted according to closely related variable groups. Specifically, this method is achieved by placing a prior encoding the relationships between variables on the data and analyzing the samples based on the posterior distribution. This method has shown a good ability to reconstruct the true underlying structure in simulated data, and has also been verified in its application to a data set on the impact of antibiotics on the composition of human gut bacteria. In this way, adaptive gPCA provides a more flexible method that can adjust the coarseness and fineness of the analysis and can better interpret the principal axes, thus helping to generate hypotheses and understand the biological basis in the data.