Accelerating Bayesian inference of dependency between complex biological traits

Zhenyu Zhang,Akihiko Nishimura,Nídia S. Trovão,Joshua L. Cherry,Andrew J. Holbrook,Xiang Ji,Philippe Lemey,Marc A. Suchard
DOI: https://doi.org/10.48550/arXiv.2201.07291
2022-09-08
Abstract:Inferring dependencies between complex biological traits while accounting for evolutionary relationships between specimens is of great scientific interest yet remains infeasible when trait and specimen counts grow large. The state-of-the-art approach uses a phylogenetic multivariate probit model to accommodate binary and continuous traits via a latent variable framework, and utilizes an efficient bouncy particle sampler (BPS) to tackle the computational bottleneck -- integrating many latent variables from a high-dimensional truncated normal distribution. This approach breaks down as the number of specimens grows and fails to reliably characterize conditional dependencies between traits. Here, we propose an inference pipeline for phylogenetic probit models that greatly outperforms BPS. The novelty lies in 1) a combination of the recent Zigzag Hamiltonian Monte Carlo (Zigzag-HMC) with linear-time gradient evaluations and 2) a joint sampling scheme for highly correlated latent variables and correlation matrix elements. In an application exploring HIV-1 evolution from 535 viruses, the inference requires joint sampling from an 11,235-dimensional truncated normal and a 24-dimensional covariance matrix. Our method yields a 5-fold speedup compared to BPS and makes it possible to learn partial correlations between candidate viral mutations and virulence. Computational speedup now enables us to tackle even larger problems: we study the evolution of influenza H1N1 glycosylations on around 900 viruses. For broader applicability, we extend the phylogenetic probit model to incorporate categorical traits, and demonstrate its use to study Aquilegia flower and pollinator co-evolution.
Methodology,Populations and Evolution,Computation
What problem does this paper attempt to address?
The paper aims to solve the computational challenges of inferring dependencies between complex biological traits while considering evolutionary relationships. Specifically, as the number of traits and samples increases, existing methods become infeasible, especially when the traits include continuous and discrete types. The paper proposes a new inference pipeline to address these issues, particularly the computational efficiency problem when dealing with a large number of samples and traits. ### Problems the paper attempts to solve 1. **Computational bottleneck**: Existing methods have a too - high computational cost when dealing with a large number of samples and traits, making it difficult to achieve efficient Bayesian inference. In particular, when a large number of latent variables need to be integrated from a high - dimensional truncated normal distribution, the computational burden increases significantly. 2. **Conditional dependence inference**: Existing methods have limitations in inferring the conditional dependence between traits and cannot reliably characterize the conditional correlations between traits. This restricts the understanding of potential causal paths. 3. **Extension to categorical traits**: Existing methods mainly deal with continuous and binary traits but lack effective support for categorical traits. The method proposed in the paper extends the support for categorical traits, enabling it to be applied to a wider range of research scenarios. ### Solutions 1. **Zigzag Hamiltonian Monte Carlo (Zigzag - HMC)**: Combined with linear - time gradient evaluation, Zigzag - HMC can efficiently sample latent variables from a high - dimensional truncated normal distribution. 2. **Joint sampling scheme**: By jointly sampling highly correlated latent variables and correlation matrix elements, the inference efficiency is improved. This enables more reliable estimation of the conditional correlations between traits. 3. **Extension to categorical traits**: The phylogenetic Probit model is extended to include categorical traits, making it suitable for more types of biological data. ### Application examples 1. **HIV evolution study**: The new method is applied to 535 HIV virus samples to infer the conditional dependence between immune escape mutations and virus pathogenicity. The results reveal the relationships between certain mutations and virus replication ability and CD4 cell count, which helps to understand the immune escape mechanism of HIV. 2. **Influenza H1N1 glycosylation pattern**: The influenza H1N1 glycosylation patterns among different hosts are studied, strong conditional dependencies are detected, and the relevant mechanisms of host switching are revealed. 3. **Aquilegia flower - pollinator co - evolution**: How the characteristics of Aquilegia flowers attract different pollinators is studied. By extending the model to include categorical traits, a more comprehensive biological explanation is provided. ### Summary By proposing a new inference pipeline, the paper solves the computational bottleneck problem when dealing with a large number of samples and traits and can more reliably infer the conditional dependence between traits. These improvements not only increase the computational efficiency but also expand the application range of the model, enabling it to handle more complex biological data.