Large scale paired antibody language models

Henry Kenlay,Frédéric A. Dreyer,Aleksandr Kovaltsuk,Dom Miketa,Douglas Pires,Charlotte M. Deane

2024-03-27

Abstract:Antibodies are proteins produced by the immune system that can identify and neutralise a wide variety of antigens with high specificity and affinity, and constitute the most successful class of biotherapeutics. With the advent of next-generation sequencing, billions of antibody sequences have been collected in recent years, though their application in the design of better therapeutics has been constrained by the sheer volume and complexity of the data. To address this challenge, we present IgBert and IgT5, the best performing antibody-specific language models developed to date which can consistently handle both paired and unpaired variable region sequences as input. These models are trained comprehensively using the more than two billion unpaired sequences and two million paired sequences of light and heavy chains present in the Observed Antibody Space dataset. We show that our models outperform existing antibody and protein language models on a diverse range of design and regression tasks relevant to antibody engineering. This advancement marks a significant leap forward in leveraging machine learning, large scale data sets and high-performance computing for enhancing antibody design for therapeutic development.

Biomolecules,Machine Learning

What problem does this paper attempt to address?

This paper focuses on the sequence modeling of antibodies, which are proteins generated by the immune system that can recognize and neutralize various antigens. With the development of next-generation sequencing technologies, billions of antibody sequences have been collected. However, these large and complex datasets pose challenges for designing better therapeutic antibodies. To address this problem, researchers have developed two high-performance antibody-specific language models, IgBert and IgT5, which can handle paired and unpaired variable region sequences. These two models are trained on large-scale unlabeled antibody sequences and can be used for various design and regression tasks, such as sequence recovery and affinity prediction, outperforming existing antibody and protein language models in these key tasks. The paper first introduces the basic structure and function of antibodies, emphasizing their importance in antigen recognition and neutralization. Then, it describes how machine learning and large-scale datasets are used to improve antibody design, particularly through pre-training and fine-tuning language models to learn the complex grammar of antibody sequences. The IgBert and IgT5 models are based on the BERT and T5 architectures, respectively, and through pre-training and fine-tuning on a large number of antibody sequences, they can learn cross-chain features and improve prediction performance. Experimental results demonstrate that these models perform well in tasks such as sequence recovery, binding affinity, and expression level prediction, indicating their potential applications in the field of antibody engineering. The paper concludes by discussing the advantages and limitations of these models and highlighting the significant improvement in prediction ability through training on paired data. In summary, this paper addresses the problem of improving antibody design using large-scale data and machine learning techniques. By developing and applying specific antibody language models, it enhances the efficiency of antibody sequence comprehension and application.

Large scale paired antibody language models

Formation of transition metal carbenes using haloalkylzinc reagents.

Addressing the antibody germline bias and its effect on language models for improved antibody design

Improving antibody language models with native pairing

Protein language models enable prediction of polyreactivity of monospecific, bispecific, and heavy-chain-only antibodies

Reprogramming Pretrained Language Models for Antibody Sequence Infilling

On Pre-trained Language Models for Antibody

Generative Antibody Design for Complementary Chain Pairing Sequences through Encoder-Decoder Language Model

Antibody Representation Learning for Drug Discovery

IgBlend: Unifying 3D Structures and Sequences in Antibody Language Models

Pre-training Antibody Language Models for Antigen-Specific Computational Antibody Design

Enhancing Antibody Language Models with Structural Information

Learning the Language of Antibody Hypervariability

Deciphering antibody affinity maturation with language models and weakly supervised learning

Novel antibody language model accelerates IgG screening and design for broad-spectrum antiviral therapy

The Surgical Importance of a Persistent Left Superior Vena Cava*

S^2ALM: Sequence-Structure Pre-trained Large Language Model for Comprehensive Antibody Representation Learning

ImmunoLingo: Linguistics-based formalization of the antibody language

Accurate Prediction of Antibody Function and Structure Using Bio-Inspired Antibody Language Model

A generative foundation model for antibody sequence understanding