Abstract:Advances in machine learning (ML) and the availability of protein sequences via high-throughput sequencing techniques have transformed the ability to design novel diagnostic and therapeutic proteins. ML allows protein engineers to capture complex trends hidden within protein sequences that would otherwise be difficult to identify in the context of the immense and rugged protein fitness landscape. Despite this potential, there persists a need for guidance during the training and evaluation of ML methods over sequencing data. Two key challenges for training discriminative models and evaluating their performance include handling severely imbalanced datasets (e.g., few high-fitness proteins among an abundance of non-functional proteins) and selecting appropriate protein sequence representations (numerical encodings). Here, we present a framework for applying ML over assay-labeled datasets to elucidate the capacity of sampling techniques and protein encoding methods to improve binding affinity and thermal stability prediction tasks. For protein sequence representations, we incorporate two widely used methods (One-Hot encoding and physiochemical encoding) and two language-based methods (next-token prediction, UniRep; masked-token prediction, ESM). Elaboration on performance is provided over protein fitness, protein size, and sampling techniques. In addition, an ensemble of protein representation methods is generated to discover the contribution of distinct representations and improve the final prediction score. We then implement multiple criteria decision analysis (MCDA; TOPSIS with entropy weighting), using multiple metrics well-suited for imbalanced data, to ensure statistical rigor in ranking our methods. Within the context of these datasets, the synthetic minority oversampling technique (SMOTE) outperformed undersampling while encoding sequences with One-Hot, UniRep, and ESM representations. Moreover, ensemble learning increased the predictive performance of the affinity-based dataset by 4% compared to the best single-encoding candidate (F1-score = 97%), while ESM alone was rigorous enough in stability prediction (F1-score = 92%).

Using protein language models for protein interaction hot spot prediction with limited data

Deep Learning Frameworks for Protein–protein Interaction Prediction

Protein-Protein Interaction Prediction is Achievable with Large Language Models

Residue-Frustration-Based Prediction of Protein-Protein Interactions Using Machine Learning

Thorough Assessment of Machine Learning Techniques for Predicting Protein-Nucleic Acid Binding Hot Spots

Protein-DNA interface hotspots prediction based on fusion features of embeddings of protein language model and handcrafted features

PLM-interact: extending protein language models to predict protein-protein interactions

Boosting Prediction Performance of Protein-Protein Interaction Hot Spots by Using Structural Neighborhood Properties

PPI-hotspotID for detecting protein-protein interaction hot spots from the free protein structure

Machine-learning techniques for the prediction of protein-protein interactions

Improving protein-protein interaction prediction using protein language model and protein network features

Pitfalls of machine learning models for protein–protein interaction networks

Effective Identification Of Hot Spots In Ppis Based On Ensemble Learning

Protein Fitness Prediction Is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods

Prediction of Protein‒DNA Interface Hot Spots Based on Empirical Mode Decomposition and Machine Learning

A two-step ensemble learning for predicting protein hot spot residues from whole protein sequence

Predicting Hot Spots Using a Deep Neural Network Approach

Prediction Of Protein-Protein Interactions Using Subcellular And Functional Localizations

Hybrid protein-ligand binding residue prediction with protein language models: Does the structure matter?

mCSM-PPI2: predicting the effects of mutations on protein–protein interactions

A survey on computational models for predicting protein–protein interactions