Abstract:Models have always been central to inferring molecular evolution and to reconstructing phylogenetic trees. Their use typically involves the development of a mechanistic framework reflecting our understanding of the underlying biological processes, such as nucleotide substitu- tions, and the estimation of model parameters by maximum likelihood or Bayesian inference. However, deriving and optimizing the likelihood of the data is not always possible under complex evolutionary scenarios or even tractable for large datasets, often leading to unrealistic simplifying assumptions in the fitted models. To overcome this issue, we coupled stochastic simulations of genome evolution with a new supervised deep learning model to infer key parameters of molecular evolution. Our model is designed to directly analyze multiple sequence alignments and estimate per-site evolutionary rates and divergence, without requiring a known phylogenetic tree. The accuracy of our predictions matched that of likelihood-based phylogenetic inference, when rate heterogeneity followed a simple gamma distribution, but it strongly exceeded it under more complex patterns of rate variation, such as codon models. Our approach is highly scalable and can be efficiently applied to genomic data, as we showed on a dataset of 26 million nucleotides from the clownfish clade. Our simulations also showed that the integration of per-site rates obtained by deep learning within a Bayesian framework led to significantly more accu- rate phylogenetic inference, particularly with respect to the estimated branch lengths. We thus propose that future advancements in phylogenetic analysis will benefit from a semi-supervised learning approach that combines deep-learning estimation of substitution rates, which allows for more flexible models of rate variation, and probabilistic inference of the phylogenetic tree, which guarantees interpretability and a rigorous assessment of statistical support.
What problem does this paper attempt to address?
This paper aims to solve the problems of difficult optimization of traditional likelihood - based molecular evolution models and unreasonable simplifying assumptions in complex evolutionary scenarios. Specifically, the author proposes a new method that combines stochastic simulation and supervised deep - learning models to infer key parameters of molecular evolution, such as site - specific substitution rates and divergence, without the need for a known phylogenetic tree. This method performs well in dealing with complex rate - variation patterns, especially in complex patterns such as codon models, and its prediction accuracy exceeds that of traditional likelihood - based phylogenetic inference methods.
### Problems the paper attempts to solve:
1. **Model optimization challenges in complex evolutionary scenarios**:
- When dealing with complex evolutionary scenarios, traditional methods have difficulty optimizing the likelihood of data, which leads to simplifying assumptions of the model that may not be in line with the actual situation.
2. **Processing of large - scale data sets**:
- For large - scale data sets, traditional methods are computationally inefficient and difficult to complete the analysis within a reasonable time.
3. **Improving the accuracy of phylogenetic inference**:
- By combining deep learning and the Bayesian framework, improve the accuracy of phylogenetic tree inference, especially in the estimation of branch lengths.
### Solutions:
- **Combination of stochastic simulation and supervised deep - learning models**:
- Use stochastic simulation to generate genomic evolution data, and then use supervised deep - learning models (such as recurrent neural networks) to infer key parameters of molecular evolution.
- **Direct analysis of multiple - sequence alignments**:
- The model can directly analyze DNA sequence alignments of multiple species, estimate the evolution rate and divergence at each site, without the need for a known phylogenetic tree.
- **Semi - supervised learning method**:
- Combine the substitution rates estimated by deep learning with phylogenetic inference in the Bayesian framework to form a semi - supervised learning method to improve the accuracy and interpretability of inference.
### Main contributions:
- **Flexibility**:
- This method can handle more complex rate - variation patterns, such as codon models, and its prediction accuracy significantly exceeds that of traditional methods in these complex patterns.
- **Efficiency**:
- This method is highly scalable and can be efficiently applied to large - scale genomic data, such as the 26 - million - nucleotide data set shown in the paper.
- **Accuracy**:
- By integrating the deep - learning - estimated site - specific rates into the Bayesian framework, the accuracy of phylogenetic tree inference is significantly improved, especially in the estimation of branch lengths.
In conclusion, this paper proposes an innovative method that combines stochastic simulation, supervised deep - learning, and Bayesian inference, solves the problems of difficult optimization and processing of large - scale data sets of traditional methods in complex evolutionary scenarios, and at the same time improves the accuracy of phylogenetic inference.