Abstract:Advances in machine learning (ML) and the availability of protein sequences via high-throughput sequencing techniques have transformed the ability to design novel diagnostic and therapeutic proteins. ML allows protein engineers to capture complex trends hidden within protein sequences that would otherwise be difficult to identify in the context of the immense and rugged protein fitness landscape. Despite this potential, there persists a need for guidance during the training and evaluation of ML methods over sequencing data. Two key challenges for training discriminative models and evaluating their performance include handling severely imbalanced datasets (e.g., few high-fitness proteins among an abundance of non-functional proteins) and selecting appropriate protein sequence representations (numerical encodings). Here, we present a framework for applying ML over assay-labeled datasets to elucidate the capacity of sampling techniques and protein encoding methods to improve binding affinity and thermal stability prediction tasks. For protein sequence representations, we incorporate two widely used methods (One-Hot encoding and physiochemical encoding) and two language-based methods (next-token prediction, UniRep; masked-token prediction, ESM). Elaboration on performance is provided over protein fitness, protein size, and sampling techniques. In addition, an ensemble of protein representation methods is generated to discover the contribution of distinct representations and improve the final prediction score. We then implement multiple criteria decision analysis (MCDA; TOPSIS with entropy weighting), using multiple metrics well-suited for imbalanced data, to ensure statistical rigor in ranking our methods. Within the context of these datasets, the synthetic minority oversampling technique (SMOTE) outperformed undersampling while encoding sequences with One-Hot, UniRep, and ESM representations. Moreover, ensemble learning increased the predictive performance of the affinity-based dataset by 4% compared to the best single-encoding candidate (F1-score = 97%), while ESM alone was rigorous enough in stability prediction (F1-score = 92%).

Enhancing the efficiency of protein language models with minimal wet-lab data through few-shot learning

Enhancing efficiency of protein language models with minimal wet-lab data through few-shot learning

Active Finetuning Protein Language Model: A Budget-Friendly Method for Directed Evolution

Likelihood-based fine-tuning of protein language models for few-shot fitness prediction and design

SESNet: Sequence-Structure Feature-Integrated Deep Learning Method for Data-Efficient Protein Engineering

Learning protein fitness landscapes with deep mutational scanning data from multiple sources

Protein Language Model Fitness Is a Matter of Preference

Contrastive Fitness Learning: Reprogramming Protein Language Models for Low- Learning of Protein Fitness Landscape

Accelerating protein engineering with fitness landscape modeling and reinforcement learning

A protein fitness predictive framework based on feature combination and intelligent searching

Multi-Scale Representation Learning for Protein Fitness Prediction

Protein Fitness Prediction Is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods

Retrieval-Enhanced Mutation Mastery: Augmenting Zero-Shot Prediction of Protein Language Model

Protein Engineering with Lightweight Graph Denoising Neural Networks

Efficient Inference, Training, and Fine-tuning of Protein Language Models

Metalic: Meta-Learning In-Context with Protein Language Models

Improving few-shot learning-based protein engineering with evolutionary sampling

Parameter-efficient fine-tuning on large protein language models improves signal peptide prediction

Fine-tuning protein language models boosts predictions across diverse tasks

Protein Language Models in Directed Evolution

Learning protein fitness models from evolutionary and assay-labeled data