Abstract:Infecting millions of people, the SARS-CoV-2 is evolving at an unprecedented rate, demanding advanced and specified analytic pipeline to capture the mutational spectra. In order to explore mutations and deletions in the spike (S) protein - the most-discussed protein of SARS-CoV-2 - we comprehensively analyzed 35,750 complete S protein-coding sequences through a custom Python-based pipeline. This GISAID-collected dataset of until 24 June 2020 covered six continents and five major climate zones. We identified 27,801 (77.77% sequences) mutated strains compared to reference Wuhan-Hu-1 wherein 84.40% of these strains mutated by only a single amino acid (aa). An outlier strain (EPI_ISL_463893) from Bosnia and Herzegovina possessed six aa substitutions. We also identified 11 residues with high aa mutation frequency, and each contains four types of aa variations. The infamous D614G variant has spread worldwide with ever-rising dominance and across regions with different climatic conditions alongside L5F and D936Y mutants, which have been documented throughout all regions and climate zones, respectively. We also found 988 unique aa substitutions spanned across 660 residues, which differed significantly among different continents (p = .003) and climatic zones (p = .021) as inferred with the Kruskal-Wallis test. Besides, 17 in-frame deletions at four sites adjacent to receptor-binding-domain were determined that may have a possible impact on attenuation. This study provides a fast and accurate pipeline for identifying mutations and deletions from the large dataset for coding and also non-coding sequences as evidenced by the representative analysis on existing S protein data. By using separate multi-sequence alignment, removing ambiguous sequences and in-frame stop codons, and utilizing pairwise alignment, this method can derive both synonymous and non-synonymous mutations (strain_ID reference aa:mutation position:strain aa). We suggest that the pipeline will aid in the evolutionary surveillance of any SARS-CoV-2 encoded proteins and will prove to be crucial in tracking the ever-increasing variation of many other divergent RNA viruses in the future. The code is available at https://github.com/SShaminur/Mutation-Analysis.

A k-mer Based Approach for SARS-CoV-2 Variant Identification

Effective and scalable clustering of SARS-CoV-2 sequences

Variation and Evolution Analysis of SARS-CoV-2 Using Self-Game Sequence Optimization

Detection of COVID-19 Using Protein Sequence Data via Machine Learning Classification Approach

A Machine Learning Approach to Identify Key Residues Involved in Protein–Protein Interactions Exemplified with SARS-CoV-2 Variants

Early computational detection of potential high-risk SARS-CoV-2 variants

Computational analysis of affinity dynamics between the variants of SARS-CoV-2 spike protein (RBD) and human ACE-2 receptor

Application of genomic signal processing as a tool for high-performance classification of SARS-CoV-2 variants: a machine learning-based approach

NGS data vectorization, clustering, and finding key codons in SARS-CoV-2 variations

Rapid classification of SARS-CoV-2 variant strains using machine learning-based label-free SERS strategy

Structural topological analysis of spike proteins of SARS-CoV-2 variants of concern highlight distinctive amino acid substitution patterns

Anomaly Detection Models for SARS-CoV-2 Surveillance Based on Genome K-Mers

Machine learning-based approach KEVOLVE efficiently identifies SARS-CoV-2 variant-specific genomic signatures

Multi-Stage Temporal Convolution Network for COVID-19 Variant Classification

Co-Mutations and Possible Variation Tendency of the Spike RBD and Membrane Protein in SARS-CoV-2 by Machine Learning

Variant-driven multi-wave pattern of COVID-19 via a Machine Learning analysis of spike protein mutations

Comprehensive annotations of the mutational spectra of SARS-CoV-2 spike protein: a fast and accurate pipeline

A Comparative Analysis of SARS-CoV-2 Variants of Concern (VOC) Spike Proteins Interacting with hACE2 Enzyme

New Virus Variant Detection Based on the Optimal Natural Metric

An unsupervised framework for comparing SARS-CoV-2 protein sequences using LLMs

Virus2Vec: Viral Sequence Classification Using Machine Learning