Baselining the Buzz Trastuzumab-HER2 Affinity, and Beyond

Lewis Chinery,Alissa M. Hummer,Brij Bhushan Mehta,Rahmad Akbar,Puneet Rawat,Andrei Slabodkin,Khang Le Quy,Fridtjof Lund-Johansen,Victor Greiff,Jeliazko R. Jeliazkov,Charlotte M. Deane
DOI: https://doi.org/10.1101/2024.03.26.586756
2024-03-29
Abstract:There is currently considerable interest in the field of antibody design, and deep learning techniques are now regularly applied to optimise antibody properties such as binding affinity. However, robust baselines within this field have not kept up with recent developments. In this study, we generate a dataset of over 524,000 Trastuzumab variants and use this to show that standard computational methods such as BLOSUM, AbLang, ESM, and Protein-MPNN can be used to design diverse antibody libraries from just a single starting sequence. These novel libraries are predicted to be enriched in binding variants and experimental validation of 700 of these designs is ongoing. We also demonstrate that, even with only a very small number of experimental data points, simple machine learning classifiers can be trained in seconds to accurately pre-screen future designs. This pre-screening maintains library diversity and saves experimental time and money.
Bioinformatics
What problem does this paper attempt to address?
The paper attempts to address issues primarily in the field of antibody design and optimization, specifically focusing on optimizing the binding affinity of Trastuzumab to HER2 (human epidermal growth factor receptor 2). Specifically, the study aims to: 1. **Generate a large-scale antibody variant dataset**: The paper generates a dataset containing over 524,000 Trastuzumab variants, which are used to train and evaluate the binding affinity classification task. 2. **Evaluate the performance of different machine learning methods on small data sets**: The study tests three classification methods—Fast Automated Machine Learning Library (FLAML), Convolutional Neural Network (CNN), and Equivariant Graph Neural Network (EGNN)—and evaluates their performance under different data volumes. The results show that CNN performs best with small data sets, achieving a PR AUC value of 0.71 with only 170 sequences. 3. **Design a diverse antibody library**: The study uses various computational methods (such as BLOSUM, AbLang, ESM, and ProteinMPNN) to design a variant library of the CDRH3 region of Trastuzumab and predicts the binding affinity of these variants using a trained CNN model. The results indicate that a significant proportion of sequences in the antibody libraries generated by these computational methods have a high binding probability. 4. **Improve experimental efficiency and reduce costs**: By pre-screening the antibody library using computational methods, the time and cost of experimental validation can be reduced. The study also demonstrates how to train machine learning models with a small amount of experimental data to further optimize the subsequent design process. 5. **Explore the impact of different experimental data sources on model performance**: The study compares the performance of CNN models trained on different data sources and finds that the models have limited generalization ability across different experimental data, which may be related to differences in experimental setups and classification criteria. Overall, the paper aims to improve the efficiency and effectiveness of antibody design and optimization by generating large-scale datasets, evaluating the performance of different machine learning methods, designing diverse antibody libraries, and optimizing experimental workflows.