Abstract:In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in the discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variablelength protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks. In addition, we conclude that the embeddings are beneficial in protein classification tasks when they are combined with raw amino acids k-mer features.

A NEW ENCODING SCHEME FOR PROTEIN STRUCTURE AND ITS APPLICATION

Identifying Protein Structural Classes by A Fusion Sequence Encoding Scheme

A new encoding scheme to improve the performance of protein structural class prediction

Mining Protein Sequence Motifs Representing Common 3D Structures.

Encoding Based on Grouped Weight for Protein Sequence and Its Application to Structural Class Prediction

Visualized Analysis of Motifs in Tetra-Peptide Conformational Space

A Framework for Direct Locating and Conformational Sampling of Protein Structural Motifs.

Influence Of Encoding Scheme On Protein Secondary Structure Prediction

Applications of Graph Theory in Protein Structure Identification

Methods for optimizing the structure alphabet sequences of proteins

Protein 3d Features and Surface Modeling Research

Fik Model: Novel Efficient Granular Computing Model for Protein Sequence Motifs and Structure Information Discovery

VISUALIZATION OF CONFORMATIONAL SPACE OF TRI-PEPTIDE AND TETRA-PEPTIDE

Decoding the Structural Keywords in Protein Structure Universe

Amino Acid Encoding Methods for Protein Sequences: A Comprehensive Review and Assessment

Projecting Three-dimensional Protein Structure into a One-dimensional Character Code Utilizing the Automated Protein Structure Analysis Method

ProTokens: A Machine-Learned Language for Compact and Informative Encoding of Protein 3D Structures

A Novel 2d Graphical Representation of Protein Sequence Based on Individual Amino Acid

Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)

A new graphical representation of protein sequences and its applications

A Seqlet-Based Maximum Entropy Markov Approach for Protein Secondary Structure Prediction