Abstract:Recent breakthroughs in AI coupled with the rapid accumulation of protein sequence and structure data have radically transformed computational protein design. New methods promise to escape the constraints of natural and laboratory evolution, accelerating the generation of proteins for applications in biotechnology and medicine. To make sense of the exploding diversity of machine learning approaches, we introduce a unifying framework that classifies models on the basis of their use of three core data modalities: sequences, structures and functional labels. We discuss the new capabilities and outstanding challenges for the practical design of enzymes, antibodies, vaccines, nanomachines and more. We then highlight trends shaping the future of this field, from large-scale assays to more robust benchmarks, multimodal foundation models, enhanced sampling strategies and laboratory automation.

What problem does this paper attempt to address?

This paper primarily explores the application and progress of machine learning in the field of functional protein design. With breakthroughs in artificial intelligence technology and the rapid accumulation of protein sequence and structural data, the field of computational protein design has undergone revolutionary changes. New methods aim to break free from the limitations of natural selection and laboratory evolution, accelerating the generation of proteins suitable for biotechnology and medical fields. The paper first introduces the core role of machine learning in protein design, which can learn complex distribution patterns from data, thereby simulating the functional landscape of proteins. This capability improves with the increase in the quantity and quality of training data and also depends on the inductive bias in algorithm design—the assumptions or constraints encoded in the model architecture. Subsequently, the paper categorizes the three main objectives of machine learning in protein design: 1. **Enhance existing functions**: Starting from proteins that already possess the desired function, mutations are introduced to improve their characteristics or enable them to realize their original biological functions under different conditions. This includes enhancing catalytic activity, binding affinity, thermal stability, etc. 2. **Redesign for new functions**: Designing proteins with entirely new functions based on existing proteins with related functions. This requires a deep understanding of the functional mechanism or extensive data on the relationship between new functions and sequences. 3. **De novo design**: Focusing on template-free protein folding design, generating sequences with diverse three-dimensional structures and oligomeric arrangements, to express stable proteins with a high success rate. The paper also discusses the practicality and limitations of various design strategies, as well as future trends, including large-scale screening, more robust benchmarking, multimodal foundational models, enhanced sampling strategies, and laboratory automation. Finally, the paper proposes a unified framework that classifies models according to their use of three core data modalities: sequence, structure, and functional labels, and looks forward to the future direction of the protein engineering field. In summary, this paper provides an in-depth analysis of how machine learning can drive the progress of functional protein design, especially in the design of enzymes, antibodies, and other biomedically important proteins, demonstrating the tremendous potential and challenges of machine learning in this field.

Machine learning for functional protein design

Machine learning for evolutionary-based and physics-inspired protein design: Current and future synergies

From sequence to function through structure: deep learning for protein design

A new age in protein design empowered by deep learning

Protein Function Analysis through Machine Learning

Adaptive machine learning for protein engineering

Machine Learning-Guided Protein Engineering

Machine Learning for Protein Engineering

Machine learning-aided design and screening of an emergent protein function in synthetic cells

Generative models for protein sequence modeling: recent advances and future directions

Structure-based protein design with deep learning

Revolutionizing Molecular Design for Innovative Therapeutic Applications through Artificial Intelligence

Artificial intelligence-aided protein engineering: from topological data analysis to deep protein language models

Facilitating Machine Learning‐Guided Protein Engineering with Smart Library Design and Massively Parallel Assays

Machine-learning-guided directed evolution for protein engineering

[Advances in machine learning for predicting protein functions].

In Silico Protein Function Prediction: the Rise of Machine Learning-Based Approaches

Protein design: from computer models to artificial intelligence

Computational protein design with evolutionary-based and physics-inspired modeling: current and future synergies

Machine learning in biological physics: From biomolecular prediction to design

Deep Learning in Protein Structural Modeling and Design