Abstract:Abstract Motivation Recent advancements in natural language processing have highlighted the effectiveness of global contextualized representations from Protein Language Models (pLMs) in numerous downstream tasks. Nonetheless, strategies to encode the site-of-interest leveraging pLMs for per-residue prediction tasks, such as crotonylation (Kcr) prediction, remain largely uncharted. Results Herein, we adopt a range of approaches for utilizing pLMs by experimenting with different input sequence types (full-length protein sequence versus window sequence), assessing the implications of utilizing per-residue embedding of the site-of-interest as well as embeddings of window residues centered around it. Building upon these insights, we developed a novel residual ConvBiLSTM network designed to process window-level embeddings of the site-of-interest generated by the ProtT5-XL-UniRef50 pLM using full-length sequences as input. This model, termed T5ResConvBiLSTM, surpasses existing state-of-the-art Kcr predictors in performance across three diverse datasets. To validate our approach of utilizing full sequence-based window-level embeddings, we also delved into the interpretability of ProtT5-derived embedding tensors in two ways: firstly, by scrutinizing the attention weights obtained from the transformer’s encoder block; and secondly, by computing SHAP values for these tensors, providing a model-agnostic interpretation of the prediction results. Additionally, we enhance the latent representation of ProtT5 by incorporating two additional local representations, one derived from amino acid properties and the other from supervised embedding layer, through an intermediate-fusion stacked generalization approach, using an n-mer window sequence (or, peptide fragment). The resultant stacked model, dubbed LMCrot, exhibits a more pronounced improvement in predictive performance across the tested datasets. Availability and implementation LMCrot is publicly available at https://github.com/KCLabMTU/LMCrot.

Pool PaRTI: A PageRank-based Pooling Method for Robust Protein Sequence Representation in Deep Learning

DeepTrio: a Ternary Prediction System for Protein-Protein Interaction Using Mask Multiple Parallel Convolutional Neural Networks

Learning Complete Protein Representation by Deep Coupling of Sequence and Structure

PiFold: Toward effective and efficient protein inverse folding

Expectation pooling: an effective and interpretable pooling method for predicting DNA-protein binding

PoNet: Pooling Network for Efficient Token Mixing in Long Sequences

Towards Interpretable Sparse Graph Representation Learning with Laplacian Pooling

A Transferability-Based Method for Evaluating the Protein Representation Learning

Regularized Optimal Transport Layers for Generalized Global Pooling Operations

Rank Pooling for Action Recognition

Aggregating Residue-Level Protein Language Model Embeddings with Optimal Transport

AlloPool: An Adaptive Graph Neural Network for Dynamic Allosteric Network Prediction in Protein Systems

Deep Manifold Transformation for Protein Representation Learning

DeepPPI: Boosting Prediction of Protein-Protein Interactions with Deep Neural Networks.

LMCrot: An enhanced protein crotonylation site predictor by leveraging an interpretable window-level embedding from a transformer-based protein language model

DeepProtein: Deep Learning Library and Benchmark for Protein Sequence Learning

DeepCF-PPI: improved prediction of protein-protein interactions by combining learned and handcrafted features based on attention mechanisms

Predicting Protein-Peptide Binding Residues Via Interpretable Deep Learning.

Robust deep learning based protein sequence design using ProteinMPNN

PETA: Evaluating the Impact of Protein Transfer Learning with Sub-word Tokenization on Downstream Applications

CommPOOL: An Interpretable Graph Pooling Framework for Hierarchical Graph Representation Learning