SARITA: A Large Language Model for Generating the S1 Subunit of the SARS-CoV-2 Spike Protein

Simone Rancati,Giovanna Nicora,Laura Bergomi,Tommaso Mario Buonocore,Daniel M Czyz,Enea Parimbelli,Riccardo Bellazzi,Marco Salemi,Mattia Prosperi,Simone Marini
DOI: https://doi.org/10.1101/2024.12.10.627777
2024-12-10
Abstract:The COVID-19 pandemic has profoundly impacted global health, economics, and daily life, with over 776 million cases and 7 million deaths from December 2019 to November 2024. Since the original SARS-CoV-2 Wuhan strain emerged, the virus has evolved into variants such as Alpha, Beta, Gamma, Delta, and Omicron, all characterized by mutations in the Spike glycoprotein, critical for viral entry into human cells via its S1 and S2 subunits. The S1 subunit, binding to the ACE2 receptor and mutating frequently, affects infectivity and immune evasion; the more conserved S2, on the other hand, facilitates membrane fusion. Predicting future mutations is crucial for developing vaccines and treatments adaptable to emerging strains, enhancing preparedness and intervention design. Generative Large Language Models (LLMs) are becoming increasingly common in the field of genomics, given their ability to generate realistic synthetic biological sequences, including applications in protein design and engineering. Here we present SARITA, an LLM with up to 1.2 billion parameters, based on GPT-3 architecture, designed to generate high-quality synthetic SARS-CoV-2 Spike S1 sequences. SARITA is trained via continuous learning on the pre-existing protein model RITA. When trained on Alpha, Beta, and Gamma variants (data up to February 2021 included), SARITA correctly predicts the evolution of future S1 mutations, including characterized mutations of Delta, Omicron and Iota variants. Furthermore, we show how SARITA outperforms alternative approaches, including other LLMs, in terms of sequence quality, realism, and similarity with real-world S1 sequences. These results indicate the potential of SARITA to predict future SARS-CoV-2 S1 evolution, potentially aiding in the development of adaptable vaccines and treatments.
Bioinformatics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to predict future mutations in the S1 subunit of the SARS - CoV - 2 virus to support the development of vaccines and treatment methods, enabling them to adapt to emerging virus strains. Specifically, the authors developed a large - language model named SARITA, which is based on the GPT - 3 architecture and has up to 1.2 billion parameters, aiming to generate high - quality synthetic SARS - CoV - 2 Spike S1 sequences. By training SARITA through the continuous learning method and using data from 2019 to 2021, the model can accurately predict the mutations in the Spike protein S1 subunit during 2022 - 2023, especially the characteristic mutations of the Delta, Omicron and Iota variants. In addition, the study also shows that SARITA is superior to other methods in terms of sequence quality, authenticity and similarity to actual S1 sequences, including other large - language models (LLMs). ### Main contributions of the paper 1. **Predicting future mutations**: SARITA can predict future mutations in the S1 subunit, which is helpful for the development of vaccines and therapeutic drugs, enabling them to better respond to emerging virus strains. 2. **High - quality sequence generation**: The synthetic S1 sequences generated by SARITA have high biological credibility and realism and can be highly similar to the original Wuhan strain and other actual S1 sequences. 3. **Superior performance**: SARITA outperforms other existing methods in multiple evaluation metrics, including random generation methods and existing large - language models (such as SpikeGPT2). ### Method overview - **Dataset**: SARITA is trained using 612,759 high - quality SARS - CoV - 2 Spike protein sequences downloaded from the GISAID database. - **Model architecture**: SARITA is based on the GPT - 3 architecture, adopts a decoder - only Transformer model, and uses Rotary Positional Embeddings (RoPE) to enhance the model's ability to capture the positional relationships in the input data. - **Training strategy**: Fine - tune the pre - trained RITA model through the continuous learning method to meet the specific requirements of the SARS - CoV - 2 S1 subunit. - **Evaluation metrics**: - **Sequence quality**: Check for the presence of invalid amino acids in the generated sequences and measure their similarity to the original Wuhan strain. - **Sequence similarity**: Evaluate the similarity between the generated sequences and the actual S1 sequences in the test set through Levenshtein distance (LD), PAM30 score and False Mutation Rate (FMR). - **Single - point mutation prediction**: Evaluate whether the mutations in the generated sequences match the mutation positions in the actual sequences. ### Results - **High - quality sequence generation**: More than 97% of the sequences generated by SARITA are of high quality and their lengths are as expected. - **High similarity**: The sequences generated by SARITA have a significantly higher PAM30 score than other methods, indicating that the sequences it generates are highly similar to the actual S1 sequences. - **Low Levenshtein distance**: The sequences generated by SARITA perform well in Levenshtein distance, indicating that it can accurately predict future S1 mutations. In conclusion, by generating high - quality synthetic S1 sequences, SARITA provides a powerful tool for predicting future mutations of SARS - CoV - 2, thus contributing to the development of vaccines and therapeutic drugs.