Abstract:Inferring dependencies between complex biological traits while accounting for evolutionary relationships between specimens is of great scientific interest yet remains infeasible when trait and specimen counts grow large. The state-of-the-art approach uses a phylogenetic multivariate probit model to accommodate binary and continuous traits via a latent variable framework, and utilizes an efficient bouncy particle sampler (BPS) to tackle the computational bottleneck -- integrating many latent variables from a high-dimensional truncated normal distribution. This approach breaks down as the number of specimens grows and fails to reliably characterize conditional dependencies between traits. Here, we propose an inference pipeline for phylogenetic probit models that greatly outperforms BPS. The novelty lies in 1) a combination of the recent Zigzag Hamiltonian Monte Carlo (Zigzag-HMC) with linear-time gradient evaluations and 2) a joint sampling scheme for highly correlated latent variables and correlation matrix elements. In an application exploring HIV-1 evolution from 535 viruses, the inference requires joint sampling from an 11,235-dimensional truncated normal and a 24-dimensional covariance matrix. Our method yields a 5-fold speedup compared to BPS and makes it possible to learn partial correlations between candidate viral mutations and virulence. Computational speedup now enables us to tackle even larger problems: we study the evolution of influenza H1N1 glycosylations on around 900 viruses. For broader applicability, we extend the phylogenetic probit model to incorporate categorical traits, and demonstrate its use to study Aquilegia flower and pollinator co-evolution.

What problem does this paper attempt to address?

The paper aims to solve the computational challenges of inferring dependencies between complex biological traits while considering evolutionary relationships. Specifically, as the number of traits and samples increases, existing methods become infeasible, especially when the traits include continuous and discrete types. The paper proposes a new inference pipeline to address these issues, particularly the computational efficiency problem when dealing with a large number of samples and traits. ### Problems the paper attempts to solve 1. **Computational bottleneck**: Existing methods have a too - high computational cost when dealing with a large number of samples and traits, making it difficult to achieve efficient Bayesian inference. In particular, when a large number of latent variables need to be integrated from a high - dimensional truncated normal distribution, the computational burden increases significantly. 2. **Conditional dependence inference**: Existing methods have limitations in inferring the conditional dependence between traits and cannot reliably characterize the conditional correlations between traits. This restricts the understanding of potential causal paths. 3. **Extension to categorical traits**: Existing methods mainly deal with continuous and binary traits but lack effective support for categorical traits. The method proposed in the paper extends the support for categorical traits, enabling it to be applied to a wider range of research scenarios. ### Solutions 1. **Zigzag Hamiltonian Monte Carlo (Zigzag - HMC)**: Combined with linear - time gradient evaluation, Zigzag - HMC can efficiently sample latent variables from a high - dimensional truncated normal distribution. 2. **Joint sampling scheme**: By jointly sampling highly correlated latent variables and correlation matrix elements, the inference efficiency is improved. This enables more reliable estimation of the conditional correlations between traits. 3. **Extension to categorical traits**: The phylogenetic Probit model is extended to include categorical traits, making it suitable for more types of biological data. ### Application examples 1. **HIV evolution study**: The new method is applied to 535 HIV virus samples to infer the conditional dependence between immune escape mutations and virus pathogenicity. The results reveal the relationships between certain mutations and virus replication ability and CD4 cell count, which helps to understand the immune escape mechanism of HIV. 2. **Influenza H1N1 glycosylation pattern**: The influenza H1N1 glycosylation patterns among different hosts are studied, strong conditional dependencies are detected, and the relevant mechanisms of host switching are revealed. 3. **Aquilegia flower - pollinator co - evolution**: How the characteristics of Aquilegia flowers attract different pollinators is studied. By extending the model to include categorical traits, a more comprehensive biological explanation is provided. ### Summary By proposing a new inference pipeline, the paper solves the computational bottleneck problem when dealing with a large number of samples and traits and can more reliably infer the conditional dependence between traits. These improvements not only increase the computational efficiency but also expand the application range of the model, enabling it to handle more complex biological data.

Accelerating Bayesian inference of dependency between complex biological traits

Efficient Bayesian Inference of General Gaussian Models on Large Phylogenetic Trees

From Genome-Scale Data to Models of Infectious Disease: A Bayesian Network-Based Strategy to Drive Model Development

Scalable Bayesian divergence time estimation with ratio transformations

Bayesian Inference of Dependent Population Dynamics in Coalescent Models

Integrating Transmission Dynamics and Pathogen Evolution Through a Bayesian Approach

Bayesian Inference of Pathogen Phylogeography using the Structured Coalescent Model

Many-core algorithms for high-dimensional gradients on phylogenetic trees

Bayesian inference of relative fitness on high-throughput pooled competition assays

A Relaxed Drift Diffusion Model for Phylogenetic Trait Evolution

An Efficient Bayesian Inference Framework for Coalescent-Based Nonparametric Phylodynamics

Online Bayesian phylodynamic inference in BEAST with application to epidemic reconstruction

Fast and Accurate Maximum-Likelihood Estimation of Multi-Type Birth-Death Epidemiological Models from Phylogenetic Trees

Understanding Past Population Dynamics: Bayesian Coalescent-Based Modeling with Covariates

Bayesian Inference of Species Trees from Multilocus Data

Assessing phenotypic correlation through the multivariate phylogenetic latent liability model

Bayesian Inference of Evolutionary Histories under Time-Dependent Substitution Rates

Random-effects substitution models for phylogenetics via scalable gradient approximations

Infinite Mixture Models for Improved Modeling of Across-Site Evolutionary Variation

Bayesian phylodynamic inference of multi-type population trajectories using genomic dat