PhyloTransformer: A Self-supervised Discriminative Model for Mutation Prediction Based on a Multi-head Self-attention Mechanism

Yingying Wu,Shusheng Xu,Shing–Tung Yau,Yi Wu
DOI: https://doi.org/10.21203/rs.3.rs-1578020/v1
2022-01-01
Abstract:Abstract Although coronaviruses have RNA proofreading functions, a large number of variants still exist as quasispecies. Identified coronaviruses might just be the tip of the iceberg, and potentially more fatal variants of concern (VOCs) may emerge over time. These VOCs may exhibit increased pathogenicity, infectivity, transmissibility, angiotensin-converting enzyme 2 (ACE2) binding affinity, and antigenicity, causing an increased threat to public health. In this article, we developed PhyloTransformer, a Transformer-based self-supervised discriminative model, which can model genetic mutations that may lead to viral reproductive advantage. We trained PhyloTransformer on 1,765,297 severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) sequences to infer fitness advantages, by directly modeling the amino acid sequence mutations. PhyloTransformer utilizes advanced techniques from natural language processing, including the Fast Attention Via positive Orthogonal Random features approach (FAVOR+) and the Masked Language Model (MLM), which enable efficient and accurate intra-sequence dependency modeling over the entire RNA sequence. We measured the prediction accuracy of novel mutations and novel combinations using our method and baseline models that only take local segments as input. We found that PhyloTransformer outperformed every baseline method with statistical significance. In order to identify mutations associated with altered glycosylation that might be favored during viral evolution, we predicted the occurrence of mutations in each nucleotide of the receptor binding motif (RBM) and predicted modifications of N-glycosylation sites. We anticipate that the viral mutations predicted by PhyloTransformer may identify potential mutations of threat to guide therapeutics and vaccine design for effective targeting of future SARS-CoV-2 variants.
What problem does this paper attempt to address?