A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models

Hirofumi Tsuruta,Hiroyuki Yamazaki,Ryota Maeda,Ryotaro Tamura,Akihiro Imura
2024-10-16
Abstract:Antibodies are crucial proteins produced by the immune system to eliminate harmful foreign substances and have become pivotal therapeutic agents for treating human diseases. To accelerate the discovery of antibody therapeutics, there is growing interest in constructing language models using antibody sequences. However, the applicability of pre-trained language models for antibody discovery has not been thoroughly evaluated due to the scarcity of labeled datasets. To overcome these limitations, we introduce AVIDa-SARS-CoV-2, a dataset featuring the antigen-variable domain of heavy chain of heavy chain antibody (VHH) interactions obtained from two alpacas immunized with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) spike proteins. AVIDa-SARS-CoV-2 includes binary labels indicating the binding or non-binding of diverse VHH sequences to 12 SARS-CoV-2 mutants, such as the Delta and Omicron variants. Furthermore, we release VHHCorpus-2M, a pre-training dataset for antibody language models, containing over two million VHH sequences. We report benchmark results for predicting SARS-CoV-2-VHH binding using VHHBERT pre-trained on VHHCorpus-2M and existing general protein and antibody-specific pre-trained language models. These results confirm that AVIDa-SARS-CoV-2 provides valuable benchmarks for evaluating the representation capabilities of antibody language models for binding prediction, thereby facilitating the development of AI-driven antibody discovery. The datasets are available at <a class="link-external link-https" href="https://datasets.cognanous.com" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Genomics
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the underutilization of pre-trained language models in antibody discovery, particularly the difficulty in evaluation due to the lack of labeled datasets. Specifically, the authors introduce two main contributions: 1. **A VIDa-SARS-CoV-2 Dataset**: - This is a labeled dataset containing antigen-heavy chain variable region (VHH) interactions, generated by two alpacas immunized with the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) spike protein. - The dataset includes binary labels indicating whether different VHH sequences bind to 12 SARS-CoV-2 variants (such as Delta and Omicron). - Experimental validation shows that these labels are highly reliable and can accurately assess model performance in antibody discovery tasks. 2. **VHHCorpus-2M Dataset**: - This is a pre-training dataset containing over 2 million unlabeled VHH sequences, used for training antibody language models. - To enhance the reliability and diversity of the sequences, the authors removed sequences that appeared only once and collected data from five different alpacas. ### Main Contributions - **Release of A VIDa-SARS-CoV-2**: A labeled SARS-CoV-2-VHH interaction dataset containing amino acid sequences, and VHHCorpus-2M, which includes over 2 million unlabeled VHH sequences. These datasets can be used to evaluate and pre-train antibody-specific language models. - **Provision of SARS-CoV-2 Variant and VHH Interaction Information**: The A VIDa-SARS-CoV-2 dataset provides interaction information between various VHHs and 12 SARS-CoV-2 variants, aiding in the study of the impact of antigen mutations on antibody binding and the differences in antigen-specific VHHs among individuals. - **Release of VHHBERT**: A VHH-specific language model pre-trained on VHHCorpus-2M. VHHBERT will serve as a baseline for subsequent VHH-specific language models. - **Reporting Benchmark Results**: Results of predicting SARS-CoV-2-VHH interactions using VHHBERT and other existing general protein and antibody-specific pre-trained language models. These results confirm the value of A VIDa-SARS-CoV-2 in evaluating the binding prediction capabilities of antibody language models. ### Related Work - **Existing Pre-trained Antibody Language Models**: The paper summarizes existing pre-trained antibody language models and their datasets, including AntiBERTy, AntiBERTa, AbLang, EATLM, BERT-DS, etc. These models excel in tasks such as recovering missing residues, predicting binding sites, and classifying B cells. - **Pre-training Datasets**: Existing pre-training datasets are mainly collections of unlabeled antibody sequences, such as the OAS database. The uniqueness of VHHCorpus-2M lies in its composition entirely of full-length VHH sequences, which are the smallest functional units that bind to target antigens. - **Evaluation Datasets**: Existing evaluation datasets are used to assess model performance for specific tasks but are limited in scale. A VIDa-SARS-CoV-2 improves model evaluation accuracy by providing sequence-level binding and non-binding labels. ### Conclusion By introducing the high-quality labeled dataset A VIDa-SARS-CoV-2 and the large-scale pre-training dataset VHHCorpus-2M, this paper fills the gap in the evaluation of pre-trained language models in the field of antibody discovery. These datasets and models provide important tools and benchmarks to accelerate antibody discovery and the development of AI-driven antibody therapies.