An integrative approach to protein sequence design through multiobjective optimization

Lu Hong,Tanja Kortemme
DOI: https://doi.org/10.1101/2024.03.01.582670
2024-03-04
Abstract:With recent methodological advances in the field of computational protein design, in particular those based on deep learning, there is an increasing need for frameworks that allow for coherent, direct integration of different models and objective functions into the generative design process. Here we demonstrate how evolutionary multiobjective optimization techniques can be adapted to provide such an approach. With the established Non-dominated Sorting Genetic Algorithm II (NSGA-II) as the optimization framework, we use AlphaFold2 and ProteinMPNN confidence metrics to define the objective space, and a mutation operator composed of ESM-1v and ProteinMPNN to rank and then redesign the least favorable positions. Using the multistate design problem of the foldswitching protein RfaH as an in-depth case study, we show that the evolutionary multiobjective optimization approach leads to significant reduction in the bias and variance in RfaH native sequence recovery, compared to a direct application of ProteinMPNN. We suggest that this improvement is due to three factors: (i) the use of an informative mutation operator that accelerates the sequence space exploration, (ii) the parallel, iterative design process inherent to the genetic algorithm that improves upon the ProteinMPNN autoregressive sequence decoding scheme, and (iii) the explicit approximation of the Pareto front that leads to optimal design candidates representing diverse tradeoff conditions. We anticipate this approach to be readily adaptable to different models and broadly relevant for protein design tasks with complex specifications.
Biophysics
What problem does this paper attempt to address?
This paper attempts to address the multi-objective optimization problem in protein sequence design. Specifically, the authors propose a method based on evolutionary multi-objective optimization techniques to integrate various models and objective functions into the generative design process. By using the Non-dominated Sorting Genetic Algorithm II (NSGA-II) as the optimization framework, and combining confidence measures from AlphaFold2 and ProteinMPNN to define the objective space, as well as mutation operators composed of ESM-1v and ProteinMPNN to redesign the most unfavorable positions, the authors demonstrate through a detailed study of the multi-state design problem of the fold-switching protein RfaH that this method can significantly reduce bias and variance in the intrinsic sequence recovery of RfaH, showing a clear advantage over directly applying the ProteinMPNN method. The authors attribute this improvement to three aspects: 1. The use of information-rich mutation operators accelerates the exploration of sequence space. 2. The iterative design process inherent in the genetic algorithm improves the autoregressive sequence decoding scheme of ProteinMPNN. 3. Explicitly approximating the Pareto front, thereby generating optimal design candidates representing different trade-off conditions. In summary, this method better integrates the advantages of different models and handles more complex protein design tasks.