Combining Bayesian optimization with sequence- or structure-based strategies for optimization of protein-peptide binding

Jérôme Eberhardt,Markus Lill,Torsten Schwede,Aidan Lees
DOI: https://doi.org/10.26434/chemrxiv-2023-b7l81-v2
2024-04-17
Abstract:This study introduces a novel Bayesian Optimization (BO) method to support the design and optimization of bioactive peptide sequences in the context of a fully automated closed-loop Design-Make-Test (DMT) pipeline. Using the major histocompatibility complex class I receptor system as test case, we showed that BO is capable to efficiently navigate vast sequence spaces. Starting from a single peptide-lead sequence in the $\mu$M IC50 range, the method is able to optimize a peptide sequence to its optimal binding affinity in less than 5 DMT cycles, with 96 peptide sequences per batch. We extensively evaluated its performance, in various conditions and with different parameters, providing valuable insights for peptide optimization tasks in future closed-loop DMT environments. Different sequence- and structure-based initialization strategies were also tested, to generate the initial batch of peptide sequences, as well as different molecular fingerprints and protein language models. Additionally, the method developed here can natively handle various peptide sequence lengths and scaffolds (e.g. macrocycles) and support any arbitrary non-standard amino acids or residue modifications. The source code of our method, Mobius, is publicly available under the Apache license at https://git.scicore.unibas.ch/schwede/mobius.
Chemistry
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to efficiently explore and optimize the peptide sequence space in protein - peptide binding optimization. Specifically, the research introduced a new Bayesian Optimization (BO) method to support the design and optimization of bioactive peptide sequences in a fully - automatic closed - loop Design - Make - Test (DMT) pipeline. By using the major histocompatibility complex class I receptor system as a test case, it was demonstrated that the BO method can effectively navigate the vast sequence space and, starting from a single peptide lead sequence (in the micromolar IC50 range), optimize the peptide sequence to its optimal binding affinity within less than 5 DMT cycles with 96 peptide sequences per batch. In addition, this method can handle peptide sequences and scaffolds of different lengths (such as macrocycles) and support any arbitrary non - standard amino acids or residue modifications. ### Key Point Summary: 1. **Objective**: Develop an efficient Bayesian optimization method for the optimization of protein - peptide binding. 2. **Method**: Utilize the Bayesian optimization method combined with sequence - or structure - based strategies to design and optimize peptide sequences in a fully - automatic closed - loop DMT pipeline. 3. **Test Case**: Use the major histocompatibility complex class I receptor system as a test case to verify the effectiveness of the method. 4. **Performance Evaluation**: The performance of the method was extensively evaluated under different conditions and parameters, providing valuable insights for future peptide optimization tasks in a closed - loop DMT environment. 5. **Innovation**: This method can handle peptide sequences and scaffolds of different lengths and support non - standard amino acids or residue modifications, having high flexibility and versatility. ### Formula Explanation: - **Gaussian Process Regression (GPR) in Bayesian Optimization**: - Gaussian process regression is a non - parametric regression method used to predict the binding of peptides to specific MHC alleles or other protein targets. Its core lies in defining a mean function \( m(x) \) and a positive - definite covariance function \( k(x, x') \). - The mean function is usually set to zero, i.e., \( m(x)=0 \). - The covariance function \( k(x, x') \) controls the shape of the function distribution, and common forms include the radial basis function (RBF) and Tanimoto similarity (TS) kernel functions. - **RBF Kernel Function**: \[ k_{\text{RBF}}(x, x')=\alpha \exp \left(-\frac{\|x - x'\|^{2}}{2l^{2}}\right) \] where \( \alpha \) and \( l \) are the scaling factor and length scale respectively, controlling the smoothness and overall variance of the covariance matrix. - **Tanimoto Similarity Kernel Function**: \[ k_{\text{TS}}(x, x')=\alpha \frac{\sum_{j = 1}^{n}x_j x'_j}{\sum_{j = 1}^{n}x_j^{2}+\sum_{j = 1}^{n}x'^{2}-\sum_{j = 1}^{n}x_j x'_j} \] where \( \alpha \) is the scaling factor and \( n \) is the size of the input vector. Through the application of these methods and formulas, this research shows how to achieve efficient and accurate optimization in complex peptide sequence optimization tasks.