Abstract:In protein engineering, machine learning (ML) advancements have led to significant progress, including protein structure prediction (e.g., AlphaFold), sequence representation through language models, and novel protein generation. However, the impact of data curation on ML model performance is underexplored. As more sequence and structural data become available, a data-centric approach is increasingly favored over a model-centric method. A data-centric approach prioritizes high-quality, domain-specific data, ensuring ML tools are trained on datasets that accurately reflect biological complexity and diversity. This paper introduces a novel methodology that integrates ancestral sequence reconstruction (ASR) into ML models, enhancing data-centric strategies in the field. ASR uses computational techniques to infer ancient protein sequences from modern descendants, providing diverse, stable sequences with rich evolutionary information. While multiple sequence alignments (MSAs) are commonly used in protein engineering frameworks to incorporate evolutionary information, ASR offers deeper insights into protein evolution. Unlike MSAs, ASR captures mutation rates, phylogenic relationships, evolutionary trajectories, and specific ancestral sequences, giving access to novel protein sequences beyond what is available in public databases by natural selection. We employed two statistical methods for ASR: joint Bayesian inference and maximum likelihood. Bayesian approaches infer ancestral sequences by sampling from the entire posterior distribution, accounting for epistatic interactions between multiple amino acid positions to capture the nuances and uncertainties of ancestral sequences. In contrast, maximum likelihood methods estimate the most probable amino acids at individual positions in isolation. Both methods provide extensive ancestral data, enhancing ML model performance in protein sequence generation and fitness prediction tasks. Our results demonstrate that generative ML models training on either Bayesian or maximum likelihood approaches produce highly stable and diverse protein sequences. We also fine-tuned the evolutionary scale ESM protein language model with reconstructed ancestral data to obtain evolutionary-driven protein representations, and downstream stability prediction tasks for Endolysin and Lysozyme C families. For Lysozyme C, ancestral-based representations outperformed the baseline ESM in KNN classification and matched the established InterPro method. In Endolysin, our novel ASR-Dist method performed on par with or better than the baseline and other fine-tuning approaches across various classification metrics. ASR-Dist showed consistent performance in both simple and complex classification models, suggesting the effectiveness of this data-centric approach in enhancing protein representations. This work demonstrates how evolutionary data can improve ML-driven protein engineering, presenting a novel data-centric approach that expands our exploration of protein sequence space and enhances our ability to predict and design functional proteins.

Evolutionary context-integrated deep sequence modeling for protein engineering

ECNet is an evolutionary context-integrated deep learning framework for protein engineering

Global-Context Aware Generative Protein Design

Generative De Novo Protein Design with Global Context

Active Finetuning Protein Language Model: A Budget-Friendly Method for Directed Evolution

Knowledge-aware Reinforced Language Models for Protein Directed Evolution

Machine-learning-guided directed evolution for protein engineering

Modeling the language of life – Deep Learning Protein Sequences

Protein Design by Directed Evolution Guided by Large Language Models

Context-aware geometric deep learning for protein sequence design

Adaptive machine learning for protein engineering

Machine learning-guided directed evolution for protein engineering

EvoSeq-ML: Advancing Data-Centric Machine Learning with Evolutionary-Informed Protein Sequence Representation and Generation

Machine learning to predict continuous protein properties from binary cell sorting data and map unseen sequence space

Proximal Exploration for Model-guided Protein Sequence Design

Latent-based Directed Evolution accelerated by Gradient Ascent for Protein Sequence Design

Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction

Learning protein fitness landscapes with deep mutational scanning data from multiple sources

Protein Language Models in Directed Evolution

Biophysics-based protein language models for protein engineering