Abstract:Generative machine learning models are increasingly being used to design novel proteins for therapeutic and biotechnological applications. However, the current methods mostly focus on the design of proteins with a fixed backbone structure, which leads to their limited ability to account for protein flexibility, one of the crucial properties for protein function. Learning to engineer protein flexibility is problematic because the available data are scarce, heterogeneous, and costly to obtain using computational as well as experimental methods. Our contributions to address this problem are three-fold. First, we comprehensively compare methods for quantifying protein flexibility and identify data relevant to learning. Second, we design and train flexibility predictors utilizing sequential or both sequential and structural information on the input. We overcome the data scarcity issue by leveraging a pre-trained protein language model. Third, we introduce a method for fine-tuning a protein inverse folding model to steer it toward desired flexibility in specified regions. We demonstrate that our method Flexpert-Design enables guidance of inverse folding models toward increased flexibility. This opens up new possibilities for protein flexibility engineering and the development of proteins with enhanced biological activities.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the neglect of protein flexibility in current protein design methods. Specifically, most of the existing protein design methods focus on the design of fixed backbone structures, which limits their consideration of protein flexibility, and protein flexibility is one of the key attributes that determine protein function. Therefore, this paper aims to develop a new tool to integrate protein flexibility into computational protein design to overcome the limitations of existing methods. ### Specific description of the problem 1. **Importance of protein flexibility**: - Proteins are highly dynamic biomolecules, and their flexibility is crucial for biological functions. In particular, in the function of enzymes, adjusting the conformational dynamics of loop regions near the active site can significantly affect substrate specificity, turnover rate, and pH - dependence. - The function of many proteins requires small molecules to be transported to the active site through tunnels in their structures, and the dynamic properties of these tunnels are crucial for protein function. 2. **Limitations of existing methods**: - Current methods mainly focus on the design of fixed backbone structures and cannot fully consider protein flexibility. - Experimental methods (such as X - ray crystallography, nuclear magnetic resonance, hydrogen - deuterium exchange - coupled mass spectrometry) are accurate but costly, time - consuming, and lack high - throughput. - Computational methods (such as coarse - grained modeling, molecular dynamics simulations) offer a wide range of options, but there is a lack of systematic comparison on large - scale data sets, and it is difficult to effectively integrate with the latest generative models. 3. **Data scarcity**: - The available protein flexibility data are scarce, heterogeneous, and costly to obtain, whether through computational or experimental methods. ### Solution To address these problems, the authors make three main contributions: 1. **Comprehensive comparison of methods for quantifying protein flexibility**: - Systematically evaluate the performance of different methods (such as molecular dynamics simulations, B - factor, AlphaFold2, ESMFold, GNM, ANM, etc.) in quantifying protein flexibility and identify relevant data that can be used for learning. 2. **Design and training of flexibility predictors**: - Develop two flexibility predictors: Flexpert - Seq (sequence - only) and Flexpert - 3D (combining sequence and structural information). Overcome the problem of data scarcity by leveraging pre - trained protein language models (such as ProtTrans). 3. **Introduction of the Flexpert - Design framework**: - Propose a new method to fine - tune protein inverse - folding models (such as ProteinMPNN) so that they can generate protein sequences with the desired flexibility according to specified flexibility instructions. ### Conclusion Through these methods, the authors demonstrate the possibility of predicting flexibility from sequences and further prove that the prediction accuracy can be improved by incorporating structural information. In addition, they also show how to guide inverse - folding models to generate protein sequences with enhanced flexibility, opening up new possibilities for protein flexibility engineering.

Learning to engineer protein flexibility

Deep learning approaches for conformational flexibility and switching properties in protein design

Coupling Protein Side-Chain and Backbone Flexibility Improves the Re-design of Protein-Ligand Specificity

Machine Learning-Guided Protein Engineering

Machine learning for functional protein design

Machine learning for evolutionary-based and physics-inspired protein design: Current and future synergies

Low-N protein engineering with data-efficient deep learning

Adaptive machine learning for protein engineering

Progressive Multi-Modality Learning for Inverse Protein Folding

Protein flexibility upon ligand binding: Docking predictions and statistical analysis

FlexVDW: A machine learning approach to account for protein flexibility in ligand docking

Transferable coarse-grained potential for $\textit{de novo}$ protein folding and design

Scalable protein design using optimization in a relaxed sequence space

Computational protein design with evolutionary-based and physics-inspired modeling: current and future synergies

Exploring protein functions from structural flexibility using CABS‐flex modeling

Protein Design by Integrating Machine Learning with Quantum Annealing and Quantum-inspired Optimization

ForceGen: End-to-end de novo protein generation based on nonlinear mechanical unfolding responses using a language diffusion model

ForceGen: End-to-end de novo protein generation based on nonlinear mechanical unfolding responses using a protein language diffusion model

A Coarse-Grained Approach to Protein Design: Learning from Design to Understand Folding

EigenFold: Generative Protein Structure Prediction with Diffusion Models

NeuroFold: A Multimodal Approach to Generating Novel Protein Variants