Abstract:In protein engineering, machine learning (ML) advancements have led to significant progress, including protein structure prediction (e.g., AlphaFold), sequence representation through language models, and novel protein generation. However, the impact of data curation on ML model performance is underexplored. As more sequence and structural data become available, a data-centric approach is increasingly favored over a model-centric method. A data-centric approach prioritizes high-quality, domain-specific data, ensuring ML tools are trained on datasets that accurately reflect biological complexity and diversity. This paper introduces a novel methodology that integrates ancestral sequence reconstruction (ASR) into ML models, enhancing data-centric strategies in the field. ASR uses computational techniques to infer ancient protein sequences from modern descendants, providing diverse, stable sequences with rich evolutionary information. While multiple sequence alignments (MSAs) are commonly used in protein engineering frameworks to incorporate evolutionary information, ASR offers deeper insights into protein evolution. Unlike MSAs, ASR captures mutation rates, phylogenic relationships, evolutionary trajectories, and specific ancestral sequences, giving access to novel protein sequences beyond what is available in public databases by natural selection. We employed two statistical methods for ASR: joint Bayesian inference and maximum likelihood. Bayesian approaches infer ancestral sequences by sampling from the entire posterior distribution, accounting for epistatic interactions between multiple amino acid positions to capture the nuances and uncertainties of ancestral sequences. In contrast, maximum likelihood methods estimate the most probable amino acids at individual positions in isolation. Both methods provide extensive ancestral data, enhancing ML model performance in protein sequence generation and fitness prediction tasks. Our results demonstrate that generative ML models training on either Bayesian or maximum likelihood approaches produce highly stable and diverse protein sequences. We also fine-tuned the evolutionary scale ESM protein language model with reconstructed ancestral data to obtain evolutionary-driven protein representations, and downstream stability prediction tasks for Endolysin and Lysozyme C families. For Lysozyme C, ancestral-based representations outperformed the baseline ESM in KNN classification and matched the established InterPro method. In Endolysin, our novel ASR-Dist method performed on par with or better than the baseline and other fine-tuning approaches across various classification metrics. ASR-Dist showed consistent performance in both simple and complex classification models, suggesting the effectiveness of this data-centric approach in enhancing protein representations. This work demonstrates how evolutionary data can improve ML-driven protein engineering, presenting a novel data-centric approach that expands our exploration of protein sequence space and enhances our ability to predict and design functional proteins.

Modeling the language of life – Deep Learning Protein Sequences

Evolutionary context-integrated deep sequence modeling for protein engineering

PEvoLM: Protein Sequence Evolutionary Information Language Model

Embeddings from protein language models predict conservation and variant effects

ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing

Evolutionary-scale prediction of atomic-level protein structure with a language model

Protein language models can capture protein quaternary state

ProtVec: A Continuous Distributed Representation of Biological Sequences

Modeling Protein Using Large-scale Pretrain Language Model

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning

Embeddings from deep learning transfer GO annotations beyond homology

Sequence Representation Approaches for Sequence-Based Protein Prediction Tasks That Use Deep Learning.

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

Learning the protein language: Evolution, structure, and function

Transformer-based deep learning for predicting protein properties in the life sciences

Single-sequence protein structure prediction using a language model and deep learning

Align-gram : Rethinking the Skip-gram Model for Protein Sequence Analysis

PIPENN-EMB: ensemble net and protein embeddings generalise protein interface prediction beyond homology

EvoSeq-ML: Advancing Data-Centric Machine Learning with Evolutionary-Informed Protein Sequence Representation and Generation

When Protein Structure Embedding Meets Large Language Models

Machine learning to predict continuous protein properties from binary cell sorting data and map unseen sequence space